Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Researchers at Oxford Presented Policy-Guided Diffusion: A Machine Learning Method for Controllable Generation of Synthetic Trajectories in Offline Reinforcement Learning RL

    Researchers at Oxford Presented Policy-Guided Diffusion: A Machine Learning Method for Controllable Generation of Synthetic Trajectories in Offline Reinforcement Learning RL

    April 16, 2024

    Reinforcement learning (RL) faces challenges due to sample inefficiency, hindering real-world adoption. Standard RL methods struggle, particularly in environments where exploration is risky. However, offline RL utilizes pre-collected data to optimize policies without online data collection. Yet, a distribution shift between the target policy and collected data presents hurdles, leading to an out-of-sample issue. This discrepancy results in overestimation bias, potentially yielding an overly optimistic target policy. This highlights the need to address distribution shifts for effective offline RL implementation.

    Prior research addresses this by explicitly or implicitly regularizing the policy toward behavior distribution. Another approach involves learning a single-step world model from the offline dataset to generate trajectories for the target policy, aiming to mitigate distribution shifts. However, this method may introduce generalization issues within the world model itself, potentially exacerbating value overestimation bias in RL policies.

    Researchers from Oxford University present policy-guided diffusion (PGD) to address the issue of compounding error in offline RL by modeling entire trajectories rather than single-step transitions. PGD trains a diffusion model on the offline dataset to generate synthetic trajectories under the behavior policy. To align these trajectories with the target policy, guidance from the target policy is applied to shift the sampling distribution. This results in a behavior-regularized target distribution, reducing divergence from the behavior policy and limiting generalization error. 

    PGD utilizes a trajectory-level diffusion model trained on an offline dataset to approximate the behavior distribution. Inspired by classifier-guided diffusion, PGD incorporates guidance from the target policy during the denoising process to steer trajectory sampling toward the target distribution. This results in a behavior-regularized target distribution, balancing action likelihoods under both policies. PGD excludes behavior policy guidance, focusing solely on target policy guidance. To control guidance strength, PGD introduces guidance coefficients, allowing for fine-tuning of the regularization level towards the behavior distribution. Also, PGD applies a cosine guidance schedule and stabilization techniques to enhance guidance stability and reduce dynamic error.

    The experiments conducted demonstrate the following key findings:

    Effectiveness of PGD:  Agents trained with synthetic experience from PGD outperform those trained on unguided synthetic data or directly on the offline dataset. 

    Guidance Coefficient Tuning: Tuning the guidance coefficient in PGD enables the sampling of trajectories with high action likelihood across a range of target policies. As the guidance coefficient increases, trajectory likelihood under each target policy increases monotonically, indicating the ability to sample high-probability trajectories with out-of-distribution (OOD) target policies.

    Low Dynamics Error: Despite sampling high-likelihood actions from the policy, PGD retains low dynamics error. Compared to an autoregressive world model (PETS), PGD achieves significantly lower error across all target policies, highlighting its robustness to different target policies.

    Training Stability: Periodic generation of synthetic data outperforms continuous generation, attributed to training stability, especially when performing guidance early in training. Both approaches consistently outperform training on real and unguided synthetic data, demonstrating the potential of PGD as an extension to replay and model-based RL methods.

    To conclude, Oxford researchers introduced PGD, offering a controllable method for synthetic trajectory generation in offline RL. By directly modeling trajectories and utilizing policy guidance, PGD achieves competitive performance compared to autoregressive methods like PETS, with lower dynamics error. This approach consistently improves downstream agent performance across diverse environments and behavior policies. PGD addresses out-of-sample issues, paving the way for less conservative algorithms in offline RL with the potential for further enhancements.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 40k+ ML SubReddit

    Want to get in front of 1.5 Million AI Audience? Work with us here

    The post Researchers at Oxford Presented Policy-Guided Diffusion: A Machine Learning Method for Controllable Generation of Synthetic Trajectories in Offline Reinforcement Learning RL appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous Article3 Questions: Enhancing last-mile logistics with machine learning
    Next Article AutoCodeRover: An Automated Artificial Intelligence AI Approach for Solving Github Issues to Autonomously Achieve Program Improvement

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Rilasciato PeaZip 10.4: Miglioramenti nell’interfaccia e gestione degli errori

    Linux

    Microsoft lifts Snapdragon exclusivity on some of the best Copilot+ PC features

    News & Updates

    The Xbox that never was: Our first detailed look at the ‘Keystone’ cloud streaming console design

    Development

    Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

    Development
    GetResponse

    Highlights

    Rilasciata IPFire 2.29 Core Update 193: Un passo avanti nella sicurezza con la crittografia post-quantistica Linux

    Rilasciata IPFire 2.29 Core Update 193: Un passo avanti nella sicurezza con la crittografia post-quantistica

    April 11, 2025

    IPFire è una distribuzione GNU/Linux open source progettata per funzionare come firewall e router, garantendo…

    I Paesi Europei Sviluppano un Supercomputer Basato su RISC-V: Tutto su EPAC1.5

    February 3, 2025

    Affordable RM CAT6A UTP STP FTP Cable Box Cost and Price in India

    May 7, 2025

    Best Free and Open Source Alternatives to Apple Dock

    June 27, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.