Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      In-House vs. Outsource Node.js Development Teams: 9 Key Differences for the C-Suite (2025)

      July 19, 2025

      Why Non-Native Content Designers Improve Global UX

      July 18, 2025

      DevOps won’t scale without platform engineering and here’s why your teams are still stuck

      July 18, 2025

      This week in AI dev tools: Slack’s enterprise search, Claude Code’s analytics dashboard, and more (July 18, 2025)

      July 18, 2025

      I ditched my Bluetooth speakers for this slick turntable – and it’s more practical than I thought

      July 19, 2025

      This split keyboard offers deep customization – if you’re willing to go all in

      July 19, 2025

      I spoke with an AI version of myself, thanks to Hume’s free tool – how to try it

      July 19, 2025

      I took a walk with Meta’s new Oakley smart glasses – they beat my Ray-Bans in every way

      July 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 19, 2025
      Recent

      The details of TC39’s last meeting

      July 19, 2025

      Simple wrapper for Chrome’s built-in local LLM (Gemini Nano)

      July 19, 2025

      Online Examination System using PHP and MySQL

      July 18, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Top 7 Computer Performance Test Tools Online (Free & Fast)

      July 19, 2025
      Recent

      Top 7 Computer Performance Test Tools Online (Free & Fast)

      July 19, 2025

      10 Best Windows 11 Encryption Software

      July 19, 2025

      Google Chrome Is Testing Dynamic Country Detection for Region-Specific Features

      July 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    June 2, 2025

    Policy gradient methods have significantly advanced the reasoning capabilities of LLMs, particularly through RL. A key tool in stabilizing these methods is Kullback-Leibler (KL) regularization, which discourages drastic changes between the current policy and the reference policy. While widely used in algorithms like PPO, there’s still much to explore in how different KL variants, such as Forward KL, Reverse KL, and their unnormalized form, can be estimated and applied within loss functions. These choices, along with various gradient estimators and on-policy vs. off-policy settings, shape training stability and performance in nuanced and underexplored ways. 

    Fine-tuning LLMs with human feedback is crucial for building aligned AI systems. Two main strategies are employed: optimizing with reward models using policy gradient methods, such as PPO, and directly training on human preferences through methods like Direct Preference Optimization (DPO). While PPO stabilizes training with reward models, DPO and its variants use pairwise comparisons to simplify and scale learning, gaining popularity in recent models. Reinforcement learning is also increasingly used to enhance LLM reasoning, especially in complex tasks like math and coding. New methods aim to reduce computational costs and improve training stability, often by replacing value networks or modifying KL penalties. 

    Researchers from UCLA, Tsinghua University, and Shanghai Qi Zhi introduce Regularized Policy Gradient (RPG), a unified framework for KL-regularized policy gradients in online reinforcement learning. They derive policy gradients and surrogate loss functions using both Forward and Reverse KL divergences, addressing normalized and unnormalized policies. RPG supports both fully differentiable objectives and REINFORCE-style estimators, tailored for off-policy training with importance sampling. The study also identifies and addresses theoretical issues in existing methods, such as GRPO, and examines KL regularization in REINFORCE++. Experiments on LLM reasoning tasks demonstrate that RPG achieves improved stability and performance compared to leading baselines, including GRPO, REINFORCE++, and DAPO. 

    The study presents policy gradient methods that incorporate KL divergence regularization in both online and off-policy settings using importance sampling from an older policy. For forward KL, the gradient involves importance-weighted rewards and a regularization term, with its loss resembling the maximum likelihood loss when the rewards are zero. The unnormalized forward KL adds a correction for mismatched distribution masses. Similarly, reverse KL and its unnormalized form penalize deviation from the reference policy, modifying the reward based on log-probability ratios. All approaches share a REINFORCE-like gradient structure, enabling alternative implementations using the stop-gradient operator, which supports stable and efficient optimization in practice. 

    The researchers conducted a thorough evaluation of their proposed RPG methods—both differentiable and REINFORCE-style—by comparing them to several established baselines on complex math reasoning tasks using Qwen2.5 language models. They trained on the DAPO-Math-17k dataset and evaluated performance using benchmarks such as AMC23 and AIME. RPG variants consistently demonstrated strong accuracy, training stability, and efficient memory usage. Implementation utilized the Verl framework and techniques such as KL regularization, PPO-style clipping, and Schedule-Free AdamW for smoother optimization. RPG models generally outperformed others in reward shaping, entropy control, and response length, highlighting their robustness and suitability for stable, high-performance learning. 

    In conclusion, RPG is a comprehensive framework for designing and analyzing policy gradient methods that incorporate KL-regularization in online, off-policy reinforcement learning. They explore a range of configurations, including both forward and reverse KL divergences, normalized and unnormalized policy distributions, and two types of estimators: fully differentiable and REINFORCE-style. RPG aims to provide a structured approach to understanding and implementing these variations. Applied to reasoning tasks with large language models, the proposed methods demonstrate more stable training and competitive or improved performance compared to established baselines, such as GRPO, REINFORCE++, and DAPO. 


    Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleNVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMs
    Next Article AI and Blockchain: Securing Transactions and Data Integrity🔐

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 19, 2025
    Machine Learning

    Language Models Improve When Pretraining Data Matches Target Tasks

    July 18, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4459 – Code-projects Patient Record Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    8 Venture Capital Firms Betting Big on Cybersecurity in 2025

    Web Development

    Build a scalable AI assistant to help refugees using AWS

    Machine Learning

    CVE-2024-24916 – Adobe Installer DLL Loading Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2024-52903 – IBM Db2 Denial of Service

    May 2, 2025

    CVE ID : CVE-2024-52903

    Published : May 1, 2025, 11:15 p.m. | 4 hours, 12 minutes ago

    Description : IBM Db2 for Linux, UNIX and Windows 12.1.0 and 12.1.1 is vulnerable to a denial of service as the server may crash under certain conditions with a specially crafted query.

    Severity: 5.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-2421 – Profelis Informatics SambaBox Code Injection Vulnerability

    May 2, 2025

    CodeRabbit brings AI-powered code review into Visual Studio Code

    May 14, 2025

    CVE-2025-48475 – FreeScout Unrestricted Client Access Vulnerability

    May 29, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.