Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      8 Key Questions Every CEO Should Ask Before Hiring a Node.js Development Company in 2025

      July 11, 2025

      Vibe Loop: AI-native reliability engineering for the real world

      July 10, 2025

      Docker Compose gets new features for building and running agents

      July 10, 2025

      Why Enterprises Are Choosing AI-Driven React.js Development Companies in 2025

      July 10, 2025

      This discounted SSD fixed my gaming handheld’s biggest weakness — Extra storage space for Steam Deck, ASUS ROG Ally, and Lenovo Legion Go

      July 11, 2025

      These are the 5 Prime Day deals I’d buy if I weren’t about to have a baby

      July 11, 2025

      OpenAI’s $6.5 billion purchase fuels Sam Altman’s quest to build next-gen computers for “transcendentally good” AI — The biggest tech disruption since the iPhone?

      July 11, 2025

      Don’t miss out on the best ROG Ally accessory deals going on now — Improve your gaming handheld PC with a microSD card, power bank, dock, and more

      July 11, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Regolith – A JavaScript library immune to ReDoS attacks

      July 11, 2025
      Recent

      Regolith – A JavaScript library immune to ReDoS attacks

      July 11, 2025

      Create Your Own Redux: Build a Custom State Management in React

      July 11, 2025

      Perficient Nagpur Celebrates Contentstack Implementation Certification Success!

      July 11, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      This discounted SSD fixed my gaming handheld’s biggest weakness — Extra storage space for Steam Deck, ASUS ROG Ally, and Lenovo Legion Go

      July 11, 2025
      Recent

      This discounted SSD fixed my gaming handheld’s biggest weakness — Extra storage space for Steam Deck, ASUS ROG Ally, and Lenovo Legion Go

      July 11, 2025

      These are the 5 Prime Day deals I’d buy if I weren’t about to have a baby

      July 11, 2025

      OpenAI’s $6.5 billion purchase fuels Sam Altman’s quest to build next-gen computers for “transcendentally good” AI — The biggest tech disruption since the iPhone?

      July 11, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models

    OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models

    June 14, 2025

    The Growing Complexity of Reasoning Data Curation

    Recent reasoning models, such as DeepSeek-R1 and o3, have shown outstanding performance in mathematical, coding, and scientific areas, utilizing post-training techniques like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the complete methodologies behind these frontier reasoning models are not public, which makes research for building reasoning models difficult. While SFT data curation has become a powerful approach for developing strong reasoning capabilities, most existing efforts explore only limited design choices, such as relying solely on human-written questions or single teacher models. Moreover, exploring the extensive design space of various techniques for generating question-answer pairs requires high costs for teacher inference and model training.

    Reasoning traces provided by models such as Gemini, QwQ, and DeepSeek-R1 have enabled knowledge distillation techniques to train smaller reasoning models. Projects like OpenR1, OpenMathReasoning, and OpenCodeReasoning collect questions from public forums and competition sites, while Natural Reasoning utilizes pre-training corpora as seed data. Some efforts, such as S1 and LIMO, focus on manually curating small, high-quality datasets of challenging prompts. Other methods, such as DeepMath-103K and Nvidia Nemotron, introduce innovations across data sourcing, filtering, and scaling stages. RL methods, including AceReason and Skywork-OR1, have enhanced reasoning capabilities beyond traditional SFT methods.

    OpenThoughts: A Scalable Framework for SFT Dataset Development

    Researchers from Stanford University, the University of Washington, BespokeLabs.ai, Toyota Research Institute, UC Berkeley, and 12 additional organizations have proposed OpenThoughts, a new SOTA open reasoning data recipe. OpenThoughts uses a progressive approach across three iterations: OpenThoughts-114K scales the Sky-T1 pipeline with automated verification, OpenThoughts2-1M enhances data scale through augmented question diversity and synthetic generation strategies, and OpenThoughts3-1.2M incorporates findings from over 1,000 ablation experiments to develop a simple, scalable, and high-performing data curation pipeline. Moreover, the model OpenThinker3-7B achieves state-of-the-art performance among open-data models at the 7B scale.

    The OpenThoughts3-1.2M is built by ablating each pipeline component independently while maintaining constant conditions across other stages, generating 31,600 data points per strategy and fine-tuning Qwen2.5-7B-Instruct on each resulting dataset. The goal during training is to create the best dataset of question-response pairs for SFT reasoning. Evaluation occurs across eight reasoning benchmarks across mathematics (AIME24, AMC23, MATH500), coding (CodeElo, CodeForces, LiveCodeBench), and science (GPQA Diamond, JEEBench). The experimental design includes a rigorous decontamination process to remove high-similarity samples and maintains a held-out benchmark set for generalization testing. Evalchemy serves as the primary evaluation tool, ensuring consistent evaluation protocols.

    Evaluation Insights and Benchmark Performance

    The OpenThoughts pipeline evaluation reveals key insights across question sourcing, mixing, filtering, answer filtering, and the teacher model. Question sourcing experiments show that CodeGolf and competitive coding questions achieve the highest performance for code tasks (25.3-27.5 average scores), while LLM-generated and human-written questions excel in mathematics (58.8-58.5 scores), and physics StackExchange questions with chemistry textbook extractions perform best in science (43.2-45.3 scores). Mixing question shows that combining multiple question sources degrades performance, with optimal results of 5% accuracy improvements over diverse mixing strategies. In the teacher model, QwQ-32B outperforms DeepSeek-R1 in knowledge distillation, achieving an accuracy improvement of 1.9-2.6%.

    In conclusion, researchers present the OpenThoughts project, showing that systematic experimentation can significantly advance SFT data curation for reasoning models. Researchers developed OpenThoughts3-1.2M, a state-of-the-art open-data reasoning dataset across science, mathematics, and coding domains. The resulting OpenThinker3-7B model achieves superior performance among open-data reasoning models at its scale. However, several limitations remain unexplored, including RL approaches, staged fine-tuning, and curriculum learning strategies. Future research directions include investigating cross-domain transfer effects when optimizing individual domains versus overall performance, and understanding the scaling dynamics as student models approach teacher capabilities.


    Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHighlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control
    Next Article Can You Build Your Dream Website Using AI? These Tools Say You Can

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 11, 2025
    Machine Learning

    Build an MCP application with Mistral models on AWS

    July 10, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4222 – WordPress Database Toolset Sensitive Information Exposure

    Common Vulnerabilities and Exposures (CVEs)

    Microsoft Shifts Windows Licensing to Azure with Confidential Computing

    Operating Systems

    Building Generative AI-Powered Apps: A Hands-On Guide for Developers

    Tech & Work

    CVE-2025-48201 – “TYPO3 ns_backup Predictable Resource Location Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    The no-nonsense approach to AI agent development

    June 11, 2025

    AI agents are software systems that take over tasks made up of manual, multi-step processes.…

    Subatomic Update: Publishing & Adopting Design Token Systems!

    April 28, 2025

    Ubiquiti UniFi Protect-camera’s via kritiek lek op afstand over te nemen

    May 7, 2025

    VideoDubber AI Celebrity Voice Generator

    April 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.