Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Shanghai AI Lab Releases OREAL-7B and OREAL-32B: Advancing Mathematical Reasoning with Outcome Reward-Based Reinforcement Learning

    Shanghai AI Lab Releases OREAL-7B and OREAL-32B: Advancing Mathematical Reasoning with Outcome Reward-Based Reinforcement Learning

    February 11, 2025

    Mathematical reasoning remains a difficult area for artificial intelligence (AI) due to the complexity of problem-solving and the need for structured, logical thinking. While large language models (LLMs) have made significant progress, they often struggle with tasks that require multi-step reasoning. Reinforcement learning (RL) has shown promise in improving these capabilities, yet traditional methods face challenges when rewards are sparse and binary, providing little feedback beyond a correct or incorrect answer.

    Shanghai AI Laboratory has developed Outcome REwArd-based reinforcement Learning (OREAL), a series of mathematical reasoning models available as OREAL-7B and OREAL-32B. This framework is designed for situations where only binary rewards—correct or incorrect—are available. Unlike conventional RL approaches that rely on dense feedback, OREAL uses Best-of-N (BoN) sampling for behavior cloning and reshapes negative rewards to maintain gradient consistency.

    OREAL-7B and OREAL-32B demonstrate that smaller models can perform competitively with significantly larger models. OREAL-7B achieves a 94.0% pass@1 score on the MATH-500 benchmark, a result comparable to previous 32B models, while OREAL-32B reaches 95.0% pass@1, surpassing previous models trained through distillation.

    Technical Insights and Advantages

    The OREAL framework introduces several key techniques to improve mathematical reasoning:

    1. Best-of-N Sampling for Behavior Cloning: BoN sampling helps select optimal positive reasoning trajectories, allowing the model to learn from well-formed solutions.
    2. Reward Reshaping for Negative Samples: By adjusting negative rewards, the framework ensures gradient consistency between correct and incorrect samples, refining model optimization.
    3. Token-Level Reward Model for Chain-of-Thought Reasoning: Mathematical reasoning often involves long sequences of logical steps. OREAL assigns importance weights to key reasoning tokens, addressing the challenge of sparse binary feedback.
    4. On-Policy Reinforcement Learning: The model dynamically refines itself based on sampled queries, improving training efficiency and adaptability.

    These techniques enable more stable training and better performance in long-sequence reasoning tasks, making reinforcement learning a viable alternative to traditional distillation approaches.

    Performance and Evaluation

    OREAL models have been tested across several benchmarks:

    • MATH-500 Benchmark:
      • OREAL-7B achieves 94.0% pass@1, a performance level previously seen only in 32B models.
      • OREAL-32B achieves 95.0% pass@1, setting a new standard in mathematical reasoning.
    • AIME2024 and OlympiadBench:
      • OREAL models outperform multiple baselines, showing strong generalization across problem types.
    • Comparison with OpenAI o-series and DeepSeek Models:
      • OREAL-32B surpasses DeepSeek-R1-Distill-Qwen-32B and OpenAI-o1-preview, demonstrating effective training strategies.
      • OREAL-7B achieves results on par with QwQ-32B-Preview and OpenAI-o1-mini, highlighting the impact of its reinforcement learning approach.

    Conclusion

    Shanghai AI Lab’s OREAL-7B and OREAL-32B models offer a refined approach to reinforcement learning in mathematical reasoning. By addressing the challenge of sparse binary rewards through Best-of-N sampling, reward shaping, and token-level importance weighting, these models achieve competitive performance even at smaller scales. The OREAL framework provides valuable insights into how reinforcement learning can be optimized for complex reasoning tasks, suggesting new directions for improving AI’s problem-solving capabilities in structured domains.


    Check out the Paper, OREAL-7B and OREAL-32B. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post Shanghai AI Lab Releases OREAL-7B and OREAL-32B: Advancing Mathematical Reasoning with Outcome Reward-Based Reinforcement Learning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBeyond Chatbots: Why Conversational AI is the Future of Business?
    Next Article This AI Paper Explores Long Chain-of-Thought Reasoning: Enhancing Large Language Models with Reinforcement Learning and Supervised Fine-Tuning

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Bill Gates says, “AI will replace humans for most things” — creating a 2-day work week in 10 years, and Copilot says it’s good for your mental health

    News & Updates

    Duolingo just added 148 new courses in its biggest update ever – thanks to AI

    News & Updates

    Unraveling Multimodal Dynamics: Insights into Cross-Modal Information Flow in Large Language Models

    Development

    This excellent designer imagined what Windows Phone would look like in 2024, and it makes me sad

    Development
    Hostinger

    Highlights

    Machine Learning

    A Coding Guide to Compare Three Stability AI Diffusion Models (v1.5, v2-Base & SD3-Medium) Diffusion Capabilities Side-by-Side in Google Colab Using Gradio

    May 5, 2025

    In this hands-on tutorial, we’ll unlock the creative potential of Stability AI’s industry-leading diffusion models,…

    Xbox’s The Outer Worlds 2 gets a new look at gameplay, showing combat, stealth, and more

    Xbox’s The Outer Worlds 2 gets a new look at gameplay, showing combat, stealth, and more

    April 9, 2025

    CVE-2025-33074 – Microsoft Azure Functions Cryptographic Signature Verification Bypass

    April 30, 2025

    Forget the Steam Summer Sale — The best PC game deals are somewhere else

    June 27, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.