Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 4, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 4, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 4, 2025

      Smashing Animations Part 4: Optimising SVGs

      June 4, 2025

      I test AI tools for a living. Here are 3 image generators I actually use and how

      June 4, 2025

      The world’s smallest 65W USB-C charger is my latest travel essential

      June 4, 2025

      This Spotlight alternative for Mac is my secret weapon for AI-powered search

      June 4, 2025

      Tech prophet Mary Meeker just dropped a massive report on AI trends – here’s your TL;DR

      June 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025
      Recent

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025

      Simplify Negative Relation Queries with Laravel’s whereDoesntHaveRelation Methods

      June 4, 2025

      Cast Model Properties to a Uri Instance in 12.17

      June 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025
      Recent

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025

      Rilasciata /e/OS 3.0: Nuova Vita per Android Senza Google, Più Privacy e Controllo per l’Utente

      June 4, 2025

      Rilasciata Oracle Linux 9.6: Scopri le Novità e i Miglioramenti nella Sicurezza e nelle Prestazioni

      June 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Open-Reasoner-Zero: An Open-source Implementation of Large-Scale Reasoning-Oriented Reinforcement Learning Training

    Open-Reasoner-Zero: An Open-source Implementation of Large-Scale Reasoning-Oriented Reinforcement Learning Training

    February 25, 2025

    Large-scale reinforcement learning (RL) training of language models on reasoning tasks has become a promising technique for mastering complex problem-solving skills. Currently, methods like OpenAI’s o1 and DeepSeek’s R1-Zero, have demonstrated remarkable training time scaling phenomenon. Both models’ benchmark performance and response length consistently and steadily increase without any sign of saturation as the training computation scales up. Inspired by these advancements, researchers in this paper have explored this new scaling phenomenon by conducting large-scale RL training directly on base models and referred to this approach as Reasoner-Zero training.

    Researchers from StepFun and Tsinghua University have proposed Open-Reasoner-Zero (ORZ), an open-source implementation of large-scale reasoning-oriented RL training for language models. It represents a significant advancement in making advanced RL training techniques accessible to the broader research community. ORZ enhances diverse reasoning skills under verifiable rewards, including arithmetic, logic, coding, and common-sense reasoning tasks. It addresses critical challenges in training stability, response length optimization, and benchmark performance improvements through a comprehensive training strategy. Unlike previous approaches that provided limited implementation details, ORZ offers detailed insights into its methodology and best practices.

    The ORZ framework utilizes the Qwen2.5-{7B, 32B} as the base model, and implements direct large-scale RL training without preliminary fine-tuning steps. The system leverages a scaled-up version of the standard PPO algorithm, optimized specifically for reasoning-oriented tasks. The training dataset consists of carefully curated question-answer pairs focusing on STEM, Math, and diverse reasoning tasks. The architecture incorporates a specialized prompt template designed to enhance inference computation capabilities. The implementation is built on OpenRLHF, featuring significant improvements including a flexible trainer, GPU collocation generation, and advanced offload-backload support mechanisms for efficient large-scale training.

    The training results demonstrate significant performance improvements across multiple metrics for both the 7B and 32B variants of Open-Reasoner-Zero. Training curves reveal consistent enhancements in reward metrics and response lengths, with a notable “step moment” phenomenon indicating sudden improvements in reasoning capabilities. During Response Length Scale-up vs DeepSeek-R1-Zero, the Open-Reasoner-Zero-32B model achieves comparable response lengths to DeepSeek-R1-Zero (671B MoE) with only 1/5.8 of the training steps. This efficiency validates the effectiveness of the minimalist approach to large-scale RL training.

    The main experimental results show that Open-Reasoner-Zero performs exceptionally well across multiple evaluation metrics, particularly in the 32B configuration. It achieves superior results compared to DeepSeek-R1-Zero-Qwen2.5-32B on the GPQA DIAMOND benchmark while requiring only 1/30 of the training steps, showcasing remarkable training efficiency. Moreover, the 7B variant exhibits interesting learning dynamics, with steady accuracy improvements and dramatic response length growth patterns. A distinctive “step moment” phenomenon has been observed during evaluation, characterized by sudden increases in both reward and response length, particularly evident in GPQA DIAMOND and AIME2024 benchmarks.

    In this paper, researchers introduced Open-Reasoner-Zero, representing a significant milestone in democratizing large-scale reasoning-oriented RL training for language models. The research shows that a simplified approach using vanilla PPO with GAE and rule-based reward functions can achieve competitive results compared to more complex systems. The successful implementation without KL regularization proves that complex architectural modifications may not be necessary for achieving strong reasoning capabilities. By open-sourcing the complete training pipeline and sharing detailed insights, this work establishes a foundation for future research in scaling language model reasoning abilities, and this is just the beginning of a new scaling trend in AI development.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

    🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post Open-Reasoner-Zero: An Open-source Implementation of Large-Scale Reasoning-Oriented Reinforcement Learning Training appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAI Assistant in Chrome Devtools: Guide for Testers
    Next Article DeepSeek AI Releases DeepEP: An Open-Source EP Communication Library for MoE Model Training and Inference

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 4, 2025
    Machine Learning

    A Coding Implementation to Build an Advanced Web Intelligence Agent with Tavily and Gemini AI

    June 4, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Exciting New Tools for Designers, May 2024

    Development

    They said Apple’s M4 MacBook Air was a ‘boring update’ – my tests told a different story

    News & Updates

    A Guide to UX Competitors’ Analysis for User Research

    Development

    One of the best college laptops I’ve tested is not a MacBook or Lenovo ThinkPad (and it’s $200 off)

    Development

    Highlights

    CVE-2025-3260 – Grafana Dashboard Permission Bypass Vulnerability

    June 2, 2025

    CVE ID : CVE-2025-3260

    Published : June 2, 2025, 10:15 a.m. | 1 hour, 7 minutes ago

    Description : A security vulnerability in the /apis/dashboard.grafana.app/* endpoints allows authenticated users to bypass dashboard and folder permissions. The vulnerability affects all API versions (v0alpha1, v1alpha1, v2alpha1).

    Impact:

    – Viewers can view all dashboards/folders regardless of permissions

    – Editors can view/edit/delete all dashboards/folders regardless of permissions

    – Editors can create dashboards in any folder regardless of permissions

    – Anonymous users with viewer/editor roles are similarly affected

    Organization isolation boundaries remain intact. The vulnerability only affects dashboard access and does not grant access to datasources.

    Severity: 8.3 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Atla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)

    April 22, 2025

    Rilasciata EndeavourOS Mercury: Tema Scuro Predefinito, Ambiente Live Migliorato e Altro

    February 11, 2025

    The AI Fix #45: The Turing test falls to GPT-4.5

    April 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.