This AI Paper from CMU and Google DeepMind Studies the Role of Synthetic Data for Improving Math Reasoning Capabilities of LLMs

Large language models (LLMs) face a critical challenge in their training process: the impending scarcity of high-quality internet data. Predictions suggest that by 2026, the available pool of such data will be exhausted, forcing researchers to turn to model-generated or synthetic data for training. This shift presents both opportunities and risks. While some studies have shown that scaling up synthetic data can improve performance on complex reasoning tasks, others have revealed a concerning trend. Training on synthetic data can potentially lead to a downward spiral in model performance, amplifying biases, propagating misinformation, and reinforcing undesired stylistic properties. The core challenge lies in designing synthetic data that effectively addresses data scarcity without compromising the quality and integrity of the resulting models. This task is particularly daunting due to the current lack of understanding regarding how synthetic data influences LLM behavior.

Researchers have explored various approaches to tackle LLM training challenges using synthetic data. Standard methods like teacher-forcing on expert data have shown limitations, particularly in math reasoning. Efforts to generate positive synthetic data aim to mimic high-quality training data, using sources like stronger teacher models and self-generated content. While this approach has shown promise, challenges persist in verifying the quality of synthetic math data. Concerns about bias amplification, model collapse, and overfitting on spurious steps remain. To mitigate these issues, researchers are investigating the use of negative model-generated responses to identify and unlearn problematic patterns in training data.

Researchers from Carnegie Mellon University, Google DeepMind, and MultiOn present the study to investigate the impact of synthetic data on LLM math reasoning capabilities. It examines both positive and negative synthetic data, finding that positive data improves performance but with slower scaling rates than pretraining. Notably, self-generated positive responses often match the effectiveness of twice the amount of data from larger models. They introduce a robust approach using negative synthetic data, contrasting it with positive data at critical steps. This technique, equivalent to per-step advantage-weighted reinforcement learning, demonstrates the potential to scale efficiency up to eight times compared to using only positive data. The study develops scaling laws for both data types on common reasoning benchmarks, offering valuable insights into optimizing synthetic data use for enhancing LLM performance in math reasoning tasks.

The detailed architecture of the proposed method involves several key components:

Â Synthetic Data Pipeline:

Prompts capable models like GPT-4 and Gemini 1.5 Pro to generate new problems similar to real ones.

Obtains solution traces with step-by-step reasoning for these problems.

Implements a binary reward function to verify the correctness of solution traces.

Dataset Construction:

Creates positive synthetic dataset from correct problem-solution pairs.

Generates positive and negative datasets using model-generated solutions.

Learning Algorithms:

Supervised Finetuning (SFT):

Trains on ð’Ÿsyn using next-token prediction.

Â Rejection Finetuning (RFT):

Uses SFT policy to generate positive responses for ð’Ÿsyn problems.

Applies next-token prediction loss on these self-generated positive responses.

Preference Optimization:

Utilizes Direct Preference Optimization (DPO) to learn from both positive and negative data.

Implements two variants: standard DPO and per-step DPO.

Per-step DPO identifies the â€œfirst pitâ€ in solution traces to focus on critical steps.

This architecture allows for comprehensive analysis of different synthetic data types and learning approaches, enabling the study of their impact on LLM math reasoning capabilities.

The study reveals significant insights into synthetic data scaling for LLM math reasoning. Positive data scaling shows improvement but with slower rates than pre-training. Surprisingly, self-generated positive data (RFT) outperforms data from more capable models, doubling efficiency. The most striking result comes from strategically using negative data with per-step Direct Preference Optimization, which increases data efficiency by 8x compared to positive data alone. This approach consistently outperforms other methods, highlighting the critical importance of carefully constructing and utilizing both positive and negative synthetic data in LLM training for mathematical reasoning tasks.

This study explores the impact of synthetic data on improving LLMsâ€™ math reasoning capabilities. It reveals that traditional methods using positive solutions from advanced models show limited efficiency. Self-generated positive data from fine-tuned 7B models improves efficiency by 2x but can amplify reliance on spurious steps. Surprisingly, incorporating negative (incorrect) traces addresses these limitations. By using negative data to estimate step-wise advantages and applying reinforcement learning techniques, the research demonstrates an 8x improvement in synthetic data efficiency. This approach, utilizing preference optimization objectives, significantly enhances LLMsâ€™ mathematical reasoning abilities by effectively balancing positive and negative synthetic data.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post This AI Paper from CMU and Google DeepMind Studies the Role of Synthetic Data for Improving Math Reasoning Capabilities of LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

A cross-platform Markdown note-taking application

A cross-platform Markdown note-taking application

AI Assistant Demo & Tips for Enterprise Projects

Celebrating Global Accessibility Awareness Day (GAAD)

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

This AI Paper from CMU and Google DeepMind Studies the Role of Synthetic Data for Improving Math Reasoning Capabilities of LLMs

February 2025 Baseline monthly digest

Markus Buehler receives 2025 Washington Award

Brisk â€“ fast, multithreaded, cross-platform download manager

The Impact of Responsive Design Testing on E-Commerce Success

Google’s New Restore Credentials Tool Simplifies App Login After Android Migration

Introducing guardrails in Knowledge Bases for Amazon Bedrock

Countdown to the Kibo Connect Client Summit 2025

Track Job Progress and Status in Laravel with the Laravel Job Status Package

The Evolution of Artificial Intelligence (AI) Agents: Workflow, Planning, and Matrix Agents Leading Enterprise Automation

CVE-2025-46827 – Graylog HTML Form Cookie Disclosure

This AI Paper from CMU and Google DeepMind Studies the Role of Synthetic Data for Improving Math Reasoning Capabilities of LLMs

Related Posts