A critical challenge in training large language models (LLMs) for reasoning tasks is identifying the most compute-efficient method for generating synthetic data that enhances model performance. Traditionally, stronger and more expensive language models (SE models) have been relied upon to produce high-quality synthetic data for fine-tuning. However, this approach is resource-intensive and restricts the amount of data that can be generated within a fixed computing budget. The main issue lies in exploring whether weaker but cheaper models (WC models) can generate data that, despite being of lower quality, could result in better or comparable training outcomes under the same computational constraints.
Current methods for improving LLM reasoning capabilities include strategies such as knowledge distillation, where a smaller model learns from a larger model, and self-improvement, where models are trained on data they generate themselves. These methods have proven effective but come with significant drawbacks, such as high computational costs that limit the volume and diversity of data produced, potentially affecting the coverage and effectiveness of training. This prompts a reassessment of whether WC models could offer a more compute-efficient solution for generating synthetic data to train LLMs effectively.
The researchers from Google DeepMind introduce a novel approach that challenges the reliance on SE models for synthetic data generation. They advocate for using WC models, which, despite their lower quality, are more cost-effective and enable the generation of larger data volumes within the same computing budget. This strategy is evaluated across key metrics: coverage, diversity, and false positive rate (FPR). The findings show that WC-generated data, despite a higher FPR, offers greater coverage and diversity compared to SE-generated data. The study also introduces a weak-to-strong improvement paradigm, where a stronger model is enhanced using data generated by a weaker one. Tested across various fine-tuning setups such as knowledge distillation and self-improvement, this method consistently outperforms traditional approaches. This shift in methodology suggests that WC models can provide a more compute-efficient strategy for developing advanced LLM reasoners.
The technical details involve a comparative analysis between SE and WC models under a fixed compute budget. Experiments were conducted using the Gemma2 family of models on datasets like MATH and GSM-8K, with Gemma2-9B and Gemma2-27B representing WC and SE models, respectively. Synthetic data was generated under two different sampling budgets (low and high), with the WC model producing three times more samples than the SE model within the same compute constraints. This data was evaluated based on coverage, diversity, and FPR. Notably, WC-generated data showed 11% higher coverage and 86% higher diversity than SE-generated data on the MATH dataset, despite a 7% increase in FPR. These results highlight the potential of WC models to generate more diverse and comprehensive training data, even with their inherent limitations.
Significant improvements in LLM performance were observed across various benchmarks. Fine-tuning models on data generated by WC models consistently yielded better results than those trained on data from SE models. For example, using WC-generated data led to a 6% improvement in accuracy during knowledge distillation and a 5.8% improvement in the weak-to-strong improvement setup on the MATH dataset. These enhancements were also seen across other datasets and training paradigms, indicating that WC models are effective in producing diverse and comprehensive training data. Despite the higher false positive rate, the broader range of correct solutions and increased problem coverage offered by WC models resulted in superior performance for the fine-tuned models. This finding suggests that employing WC models under a fixed computing budget can lead to more efficient training, challenging the conventional preference for SE models.
Using WC models for synthetic data generation proves to be more compute-efficient than relying on SE models. By generating more diverse and comprehensive training data within a fixed compute budget, WC models enable the training of stronger LLM reasoners. These findings challenge the conventional wisdom in AI research, demonstrating that smaller, weaker models, when used optimally, can outperform stronger models in certain contexts. This approach has significant implications for the future of AI research, suggesting new pathways for training LLMs more efficiently as the performance gap between small and large models continues to narrow.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’
The post Can Smaller AI Models Outperform Giants? This AI Paper from Google DeepMind Unveils the Power of ‘Smaller, Weaker, Yet Better’ Training for LLM Reasoners appeared first on MarkTechPost.
Source: Read MoreÂ