Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 31, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 31, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 31, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 31, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025

      I love Elden Ring Nightreign’s weirdest boss — he bargains with you, heals you, and throws tantrums if you ruin his meditation

      May 31, 2025

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025
      Recent

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025

      Filament Is Now Running Natively on Mobile

      May 31, 2025

      How Remix is shaking things up

      May 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025
      Recent

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025

      I love Elden Ring Nightreign’s weirdest boss — he bargains with you, heals you, and throws tantrums if you ruin his meditation

      May 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback

    Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback

    December 26, 2024

    Reward functions play a crucial role in reinforcement learning (RL) systems, but their design presents significant challenges in balancing task definition simplicity with optimization effectiveness. The conventional approach of using binary rewards offers a straightforward task definition but creates optimization difficulties due to sparse learning signals. While intrinsic rewards have emerged as a solution to aid policy optimization, their crafting process requires extensive task-specific knowledge and expertise, placing substantial demands on human experts who must carefully balance multiple factors to create reward functions that accurately represent the desired task and enable efficient learning.

    Recent approaches have utilized Large Language Models (LLMs) to automate reward design based on natural language task descriptions, following two main methodologies. The first approach focuses on generating reward function codes through LLMs, which has shown success in continuous control tasks. However, this method faces limitations as it requires access to environment source code or detailed parameter descriptions and struggles with processing high-dimensional state representations. The second approach involves generating reward values directly through LLMs, exemplified by methods like Motif, which ranks observation captions using LLM preferences. However, it requires pre-existing captioned observation datasets and involves a time-consuming three-stage process.

    Researchers from Meta, the University of Texas Austin, and UCLA have proposed ONI, a novel distributed architecture that simultaneously learns RL policies and intrinsic reward functions using LLM feedback. The method uses an asynchronous LLM server to annotate the agent’s collected experiences, which are then transformed into an intrinsic reward model. The approach explores various algorithmic methods for reward modeling, including hashing, classification, and ranking models, to investigate their effectiveness in addressing sparse reward problems. This unified methodology achieves superior performance in challenging sparse reward tasks within the NetHack Learning Environment, operating solely on the agent’s gathered experience without requiring external datasets.

    ONI uses several key components built upon the Sample Factory library and its asynchronous variant proximal policy optimization (APPO). The system operates with 480 concurrent environment instances on a Tesla A100-80GB GPU with 48 CPUs, achieving approximately 32k environment interactions per second. The architecture incorporates four crucial components: an LLM server on a separate node, an asynchronous process for transmitting observation captions to the LLM server via HTTP requests, a hash table for storing captions and LLM annotations, and a dynamic reward model learning code. This asynchronous design maintains 80-95% of the original system throughput, processing 30k environment interactions per second without reward model training and 26k interactions when training a classification-based reward model.

    The experimental results demonstrate significant performance improvements across multiple tasks in the NetHack Learning Environment. While the extrinsic reward agent performs adequately on the dense Score task, it fails on sparse reward tasks. ‘ONI-classification’ matches or approaches the performance of existing methods like Motif across most tasks, achieving this without pre-collected data or additional dense reward functions. Among ONI variants, ‘ONI-retrieval’ shows strong performance, while ‘ONI-classification’ consistently improves through its ability to generalize to unseen messages. Moreover, the ‘ONI-ranking’ achieves the highest experience levels, while ‘ONI-classification’ leads in other performance metrics in reward-free settings.

    In this paper, researchers introduced ONI which represents a significant advancement in RL by introducing a distributed system that simultaneously learns intrinsic rewards and agent behaviors online. It shows state-of-the-art performance across challenging sparse reward tasks in the NetHack Learning Environment while eliminating the need for pre-collected datasets or auxiliary dense reward functions that were previously essential. This work establishes a foundation for developing more autonomous intrinsic reward methods that can learn exclusively from agent experience, operate independently of external dataset constraints, and effectively integrate with high-performance reinforcement learning systems.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis Machine Learning Research from Amazon Introduces a New Open-Source High-Fidelity Dataset for Automotive Aerodynamics
    Next Article Meet CoMERA: An Advanced Tensor Compression Framework Redefining AI Model Training with Speed and Precision

    Related Posts

    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    May 31, 2025
    Artificial Intelligence

    LWiAI Podcast #201 – GPT 4.5, Sonnet 3.7, Grok 3, Phi 4

    May 31, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    My new favorite iPhone portable charger has a magnetic superpower – and it’s cheap

    News & Updates

    CVE-2025-43915 – Linkerd Proxy Resource Exhaustion Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Smashing Security podcast #401: Hacks on the high seas, and how your home can be stolen under your nose

    Development

    What’s Life Cycle Assessment and How it Helps Design Sustainable Products

    Development

    Highlights

    News & Updates

    Thanks to Xbox’s price hike, the Series S is now more expensive than the PS5

    May 3, 2025

    The Xbox price hikes make Xbox Series S no longer the best value in gaming,…

    How do Language Agents Perform in Translating Long-Text Novels? Meet TransAgents: A Multi-Agent Framework Using LLMs to Tackle the Complexities of Literary Translation

    May 26, 2024

    ChatGPT’s Deep Research is now launching for $20/month Pro users, but not everyone’s happy about it

    March 16, 2025

    Leveraging BigQuery JSON for Optimized MongoDB Dataflow Pipelines

    February 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.