Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs

    s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs

    February 6, 2025

    Language models (LMs) have significantly progressed through increased computational power during training, primarily through large-scale self-supervised pretraining. While this approach has yielded powerful models, a new paradigm called test-time scaling has emerged, focusing on improving performance by increasing computation at inference time. OpenAI’s o1 model has validated this approach, showing enhanced reasoning capabilities through test-time compute scaling. However, replicating these results has proven challenging, with various attempts using techniques like Monte Carlo Tree Search (MCTS), multi-agent approaches, and reinforcement learning. Even models like DeepSeek R1 have used millions of samples and complex training stages, yet none have replicated the test-time scaling behavior in o1.

    Various methods have been developed to tackle the test-time scaling challenge. Sequential scaling approaches enable models to generate successive solution attempts, with each iteration building upon previous outcomes. Tree-based search methods combine sequential and parallel scaling, implementing techniques like MCTS and guided beam search. REBASE has emerged as a notable approach, utilizing a process reward model to optimize tree search through balanced exploitation and pruning, showing superior performance compared to sampling-based methods and MCTS. These approaches heavily rely on reward models, which come in two forms: outcome reward models for evaluating complete solutions in Best-of-N selection, and process reward models for assessing individual reasoning steps in tree-based search methods.

    Researchers from Stanford University, the University of Washington, the Allen Institute for AI, and Contextual AI have proposed a streamlined approach to achieve test-time scaling and enhanced reasoning capabilities. Their method centers on two key innovations: the carefully curated s1K dataset comprising 1,000 questions with reasoning traces, selected based on difficulty, diversity, and quality criteria, and a novel technique called budget forcing. This budget-forcing mechanism controls test-time computation by either cutting short or extending the model’s thinking process through strategic “Wait” insertions, enabling the model to review and correct its reasoning. The approach was implemented by fine-tuning the Qwen2.5-32B-Instruct language model on the s1K dataset.

    The data selection process follows a three-stage filtering approach based on quality, difficulty, and diversity criteria. The quality filtering stage begins by removing samples with API errors and formatting issues, reducing the initial dataset to 51,581 examples, from which 384 high-quality samples are initially selected. The difficulty assessment employs two key metrics: model performance evaluation using Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct models, with correctness verified by Claude 3.5 Sonnet, and reasoning trace length measured by the Qwen2.5 tokenizer. For diversity, questions are classified into specific domains using the Mathematics Subject Classification system through Claude 3.5 Sonnet. This comprehensive filtering process results in a final dataset of 1,000 samples spanning 50 domains.

    The s1-32B model demonstrates significant performance improvements through test-time compute scaling with budget forcing. s1-32B operates in a superior scaling paradigm compared to the base Qwen2.5-32B-Instruct model using majority voting, validating the effectiveness of sequential scaling over parallel approaches. Moreover, s1-32B emerges as the most efficient open data reasoning model in sample efficiency, showing marked improvement over the base model with just 1,000 additional training samples. While r1-32B achieves better performance it requires 800 times more training data. Notably, s1-32B approaches Gemini 2.0 Thinking’s performance on AIME24, suggesting successful knowledge distillation.

    This paper shows that Supervised Fine-Tuning (SFT) with just 1,000 carefully selected examples can create a competitive reasoning model that matches the o1-preview’s performance and achieves optimal efficiency. The introduced budget forcing technique, when combined with the reasoning model, successfully reproduces OpenAI’s test-time scaling behavior. The effectiveness of such minimal training data suggests that the model’s reasoning capabilities are largely present from pretraining on trillions of tokens, with the fine-tuning process merely activating these latent abilities. This aligns with the “Superficial Alignment Hypothesis” from LIMA research, suggesting that a relatively small number of examples can effectively align a model’s behavior with desired outcomes.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow Aetion is using generative AI and Amazon Bedrock to translate scientific intent to results
    Next Article Weekly JavaScript Roundup: Friday Links 17, February 07, 2025

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best Generative AI Courses

    Development

    How Bill Gates, the Altair 8800 and BASIC propelled me into the PC revolution

    News & Updates

    Get a Walmart+ membership for half off right now. Here’s how

    Development

    Analyst View: Why platform engineering matters more than ever

    Tech & Work

    Highlights

    CVE-2025-45887 – Yifang CMS SSRF Vulnerability

    May 9, 2025

    CVE ID : CVE-2025-45887

    Published : May 9, 2025, 3:15 p.m. | 22 minutes ago

    Description : Yifang CMS v2.0.2 is vulnerable to Server-Side Request Forgery (SSRF) in /api/file/getRemoteContent.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    TMF Group Welcomes Kumar Ravi as New Chief Information Security Officer

    November 11, 2024

    Copilot in Excel will summarize text columns and create custom and PivotTable charts

    June 20, 2024

    The Pros and Cons of Using Variables in Figma Prototypes

    June 17, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.