Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning

    Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning

    December 24, 2024

    Large Language Models (LLMs) have demonstrated impressive proficiency in numerous tasks, but their ability to perform multi-step reasoning remains a significant challenge. This limitation becomes particularly evident in complex scenarios such as mathematical problem-solving, embodied agent control, and web navigation. Traditional Reinforcement Learning (RL) methods, like Proximal Policy Optimization (PPO), have been applied to address this issue but often come with high computational and data costs, making them less practical. Likewise, methods such as Direct Preference Optimization (DPO), while effective for aligning models with human preferences, struggle with multi-step reasoning tasks. DPO’s reliance on pairwise preference data and uniform token treatment undermines its capacity to assign credit effectively in situations with sparse rewards. These obstacles highlight the need for more targeted and efficient solutions to enhance LLM reasoning capabilities.

    Introducing OREO: Offline Reasoning Optimization

    OREO (Offline REasoning Optimization) is an offline RL approach specifically designed to address the shortcomings of existing methods in improving multi-step reasoning for LLMs. Developed collaboratively by researchers from UC San Diego, Tsinghua University, Salesforce Research, and Northwestern University, OREO builds on insights from maximum entropy reinforcement learning. It trains a policy model and a value function concurrently by optimizing the soft Bellman Equation. This methodology removes the dependency on pairwise preference data, making it possible to utilize unpaired datasets with sparse rewards. Furthermore, OREO enables precise credit assignment across reasoning trajectories, which is especially beneficial when success depends on a few critical steps. The framework can also be extended to iterative exploration setups and incorporates a learned value function to enhance inference through tree search during testing.

    Technical Details and Benefits

    OREO’s core innovation lies in optimizing the soft Bellman Equation to simultaneously train policy and value models. This strategy ensures accurate credit assignment across reasoning steps, addressing the limitations of methods like DPO. Additionally, OREO offers step-level and response-level objectives, providing flexibility for different granularities of reasoning tasks. During test-time inference, the value function supports advanced search techniques, such as beam search, improving accuracy. Unlike baseline methods like supervised fine-tuning (SFT) or rejection sampling, OREO excels at leveraging failed trajectories to enhance model robustness and adaptability. This capacity to learn from failures makes it particularly valuable for iterative multi-step reasoning tasks.

    Results and Insights

    OREO’s performance has been rigorously evaluated on benchmarks such as GSM8K and MATH for mathematical reasoning, and ALFWorld for embodied agent control. Key findings include:

    • On GSM8K, OREO delivered a 5.2% relative improvement in accuracy using a 1.5B parameter model compared to SFT, alongside a 10.5% improvement on MATH.
    • 52.5% on MATH with 1.5B LLM (w/o using augmented problem set)
    • In ALFWorld, OREO achieved a 17.7% relative improvement in performance in unseen environments, underscoring its ability to generalize beyond training data.

    Iterative training further amplified OREO’s effectiveness, showing consistent accuracy gains over multiple iterations. While approaches like rejection sampling exhibited diminishing returns, OREO continued to improve by incorporating insights from failed attempts. Test-time search using OREO’s value function resulted in up to a 17.9% relative improvement over greedy decoding on the MATH dataset, highlighting its impact on inference quality.

    Conclusion

    OREO provides a practical and effective solution for enhancing multi-step reasoning in LLMs through offline RL. By addressing the limitations of existing approaches, it offers a scalable method for improving reasoning capabilities. Its integration of detailed credit assignment, iterative training, and test-time search makes it a versatile tool for addressing complex reasoning challenges. The results demonstrate OREO’s potential for application across a range of domains requiring sophisticated problem-solving, contributing to the evolution of AI systems capable of deeper reasoning.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post Meet OREO (Offline REasoning Optimization): An Offline Reinforcement Learning Method for Enhancing LLM Multi-Step Reasoning appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleTop 25 AI Tools for Increasing Sales in 2025
    Next Article Why Do Task Vectors Exist in Pretrained LLMs? This AI Research from MIT and Improbable AI Uncovers How Transformers Form Internal Abstractions and the Mechanisms Behind in-Context Learning (ICL)

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48187 – RAGFlow Authentication Bypass

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Infortrend NAS CS4000U Storage Cost and Price in India – Affordable and Reliable

    Web Development

    This AI Paper Introduces R1-Onevision: A Cross-Modal Formalization Model for Advancing Multimodal Reasoning and Structured Visual Interpretation

    Machine Learning

    5 Linux commands to use for quickly viewing the content of files

    Development

    What is Polymorphism in Python? Explained with an Example

    Development

    Highlights

    Development

    How to check webservice response in database using Selenium or UFT

    August 11, 2024

    I have an application. Different web services gets called when any action is performed. Let me explain my scenario:
    When I update a value in application, it also gets updated in LPS system. The LPS is updated by using web-services. If we received success in web-service_receive then it means value is successfully updated in LPS.
    I want to check whether I get success or not whenever a web-service gets called.

    In database, a new record gets inserted every time a web-service is called. One with webservice_receive and one with webservice_sent.
    Is there any way to automate this scenario. Screenshot of database table is attached.

    CVE-2025-4664 – Google Chrome Cross-Origin Data Leaking Vulnerability

    May 14, 2025

    How IPv4 Works – A Handbook for Developers

    April 30, 2025

    CVE-2025-36546 – F5OS SSH Key-Based Authentication Privilege Escalation

    May 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.