Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      In-House vs. Outsource Node.js Development Teams: 9 Key Differences for the C-Suite (2025)

      July 19, 2025

      Why Non-Native Content Designers Improve Global UX

      July 18, 2025

      DevOps won’t scale without platform engineering and here’s why your teams are still stuck

      July 18, 2025

      This week in AI dev tools: Slack’s enterprise search, Claude Code’s analytics dashboard, and more (July 18, 2025)

      July 18, 2025

      I ditched my Bluetooth speakers for this slick turntable – and it’s more practical than I thought

      July 19, 2025

      This split keyboard offers deep customization – if you’re willing to go all in

      July 19, 2025

      I spoke with an AI version of myself, thanks to Hume’s free tool – how to try it

      July 19, 2025

      I took a walk with Meta’s new Oakley smart glasses – they beat my Ray-Bans in every way

      July 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 19, 2025
      Recent

      The details of TC39’s last meeting

      July 19, 2025

      Simple wrapper for Chrome’s built-in local LLM (Gemini Nano)

      July 19, 2025

      Online Examination System using PHP and MySQL

      July 18, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Top 7 Computer Performance Test Tools Online (Free & Fast)

      July 19, 2025
      Recent

      Top 7 Computer Performance Test Tools Online (Free & Fast)

      July 19, 2025

      10 Best Windows 11 Encryption Software

      July 19, 2025

      Google Chrome Is Testing Dynamic Country Detection for Region-Specific Features

      July 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

    LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

    April 19, 2025

    Language models have made significant strides in tackling reasoning tasks, with even small-scale supervised fine-tuning (SFT) approaches such as LIMO and s1 demonstrating remarkable improvements in mathematical problem-solving capabilities. However, fundamental questions remain about these advancements: Do these models genuinely generalise beyond their training data, or are they merely overfitting to test sets? The research community faces challenges in understanding which capabilities are enhanced through small-scale SFT and which limitations persist despite these improvements. Despite impressive performance on popular benchmarks, there is an incomplete understanding of these fine-tuned models’ specific strengths and weaknesses, creating a critical gap in knowledge about their true reasoning abilities and practical limitations.

    Various attempts have been made to understand the effects of reasoning-based supervised fine-tuning beyond simple benchmark scores. Researchers have questioned whether SFT merely improves performance on previously seen problem types or genuinely enables models to transfer problem-solving strategies to new contexts, such as applying coordinate-based techniques in geometry. Existing methods focus on factors like correctness, solution length, and response diversity, which initial studies suggest play significant roles in model improvement through SFT. However, these approaches lack the granularity needed to determine exactly which types of previously unsolvable questions become solvable after fine-tuning, and which problem categories remain resistant to improvement despite extensive training. The research community still struggles to establish whether observed improvements reflect deeper learning or simply memorisation of training trajectories, highlighting the need for more sophisticated analysis methods.

    The researchers from the University of California, Berkeley and the Allen Institute for AI propose a tiered analysis framework to investigate how supervised fine-tuning affects reasoning capabilities in language models. This approach utilises the AIME24 dataset, chosen for its complexity and widespread use in reasoning research, which exhibits a ladder-like structure where models solving higher-tier questions typically succeed on lower-tier ones. By categorising questions into four difficulty tiers, Easy, Medium, Hard, and Exh, the study systematically examines the specific requirements for advancing between tiers. The analysis reveals that progression from Easy to Medium primarily requires adopting an R1 reasoning style with long inference context, while Hard-level questions demand greater computational stability during deep exploration. Exh-level questions present a fundamentally different challenge, requiring unconventional problem-solving strategies that current models uniformly struggle with. The research also identifies four key insights: the performance gap between potential and stability in small-scale SFT models, minimal benefits from careful dataset curation, diminishing returns from scaling SFT datasets, and potential intelligence barriers that may not be overcome through SFT alone.

    The methodology employs a comprehensive tiered analysis using the AIME24 dataset as the primary test benchmark. This choice stems from three key attributes: the dataset’s hierarchical difficulty that challenges even state-of-the-art models, its diverse coverage of mathematical domains, and its focus on high school mathematics that isolates pure reasoning ability from domain-specific knowledge. Qwen2.5-32 B-Instruct serves as the base model due to its widespread adoption and inherent cognitive behaviours, including verification, backtracking, and subgoal setting. The fine-tuning data consists of question-response pairs from the Openr1-Math-220k dataset, specifically using CoT trajectories generated by DeepSeek R1 for problems from NuminaMath1.5, with incorrect solutions filtered out. The training configuration mirrors prior studies with a learning rate of 1 × 10−5, weight decay of 1 × 10−4, batch size of 32, and 5 epochs. Performance evaluation employs avg@n (average pass rate over multiple attempts) and cov@n metrics, with questions categorised into four difficulty levels (Easy, Medium, Hard, and Extremely Hard) based on model performance patterns.

    Research results reveal that effective progression from Easy to Medium-level mathematical problem-solving requires minimal but specific conditions. The study systematically examined multiple training variables, including foundational knowledge across diverse mathematical categories, dataset size variations (100-1000 examples per category), trajectory length (short, normal, or long), and trajectory style (comparing DeepSeek-R1 with Gemini-flash). Through comprehensive ablation studies, researchers isolated the impact of each dimension on model performance, represented as P = f(C, N, L, S), where C represents category, N represents the number of trajectories, L represents length, and S represents style. The findings demonstrate that achieving performance ≥90% on Medium-level questions minimally requires at least 500 normal or long R1-style trajectories, regardless of the specific mathematical category. Models consistently fail to meet performance thresholds when trained with fewer trajectories, shorter trajectories, or Gemini-style trajectories. This indicates that reasoning trajectory length and quantity represent critical factors in developing mathematical reasoning capabilities, while the specific subject matter of the trajectories proves less important than their structural characteristics.

    The research demonstrates that models with small-scale supervised fine-tuning can potentially solve as many questions as more sophisticated models like Deepseek-R1, though significant challenges remain. The primary limitation identified is instability in mathematical reasoning, rather than capability. Experimental results show that geometry-trained models can achieve a coverage score of 90, matching R1’s performance when given multiple attempts, yet their overall accuracy lags by more than 20%. This performance gap stems primarily from instability in deep exploration and computational limitations during complex problem-solving. While increasing the SFT dataset size offers one solution path, performance enhancement follows a logarithmic scaling trend with diminishing returns. Notably, the study challenges recent assertions about the importance of careful dataset curation, revealing that performance across various mathematical categories remains consistent within a narrow range of 55±4%, with only marginal differences between specifically constructed similar datasets and randomly constructed ones. This conclusion suggests that the quantity and quality of reasoning trajectories matter more than subject-specific content for developing robust mathematical reasoning capabilities.


    Here is the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleArch Linux saluta Redis e adotta Valkey: cosa cambia per la comunità GNU/Linux
    Next Article LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG Systems

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 19, 2025
    Machine Learning

    Language Models Improve When Pretraining Data Matches Target Tasks

    July 18, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-6265 – Zyxel NWA50AX PRO Path Traversal Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4533 – JeecgBoot Document Library Upload Remote Resource Consumption Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    FBI and Europol Disrupt Lumma Stealer Malware Network Linked to 10 Million Infections

    Development

    CVE-2025-52886 – Poppler PDF Rendering Library Use-After-Free Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    My 5 favorite AI apps on Android right now – and how I use them

    April 15, 2025

    Sure, Android ships with a default AI, but you have several smart alternatives – and…

    CVE-2025-1951 – IBM Hardware Management Console Privilege Escalation Vulnerability

    April 22, 2025

    CVE-2025-44614 – Tinxy WiFi Lock Controller Plaintext Storage Vulnerability

    May 30, 2025

    Malbian is a Linux distribution for malware analysis and reverse engineering

    April 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.