Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback

    Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback

    December 26, 2024

    Reward functions play a crucial role in reinforcement learning (RL) systems, but their design presents significant challenges in balancing task definition simplicity with optimization effectiveness. The conventional approach of using binary rewards offers a straightforward task definition but creates optimization difficulties due to sparse learning signals. While intrinsic rewards have emerged as a solution to aid policy optimization, their crafting process requires extensive task-specific knowledge and expertise, placing substantial demands on human experts who must carefully balance multiple factors to create reward functions that accurately represent the desired task and enable efficient learning.

    Recent approaches have utilized Large Language Models (LLMs) to automate reward design based on natural language task descriptions, following two main methodologies. The first approach focuses on generating reward function codes through LLMs, which has shown success in continuous control tasks. However, this method faces limitations as it requires access to environment source code or detailed parameter descriptions and struggles with processing high-dimensional state representations. The second approach involves generating reward values directly through LLMs, exemplified by methods like Motif, which ranks observation captions using LLM preferences. However, it requires pre-existing captioned observation datasets and involves a time-consuming three-stage process.

    Researchers from Meta, the University of Texas Austin, and UCLA have proposed ONI, a novel distributed architecture that simultaneously learns RL policies and intrinsic reward functions using LLM feedback. The method uses an asynchronous LLM server to annotate the agent’s collected experiences, which are then transformed into an intrinsic reward model. The approach explores various algorithmic methods for reward modeling, including hashing, classification, and ranking models, to investigate their effectiveness in addressing sparse reward problems. This unified methodology achieves superior performance in challenging sparse reward tasks within the NetHack Learning Environment, operating solely on the agent’s gathered experience without requiring external datasets.

    ONI uses several key components built upon the Sample Factory library and its asynchronous variant proximal policy optimization (APPO). The system operates with 480 concurrent environment instances on a Tesla A100-80GB GPU with 48 CPUs, achieving approximately 32k environment interactions per second. The architecture incorporates four crucial components: an LLM server on a separate node, an asynchronous process for transmitting observation captions to the LLM server via HTTP requests, a hash table for storing captions and LLM annotations, and a dynamic reward model learning code. This asynchronous design maintains 80-95% of the original system throughput, processing 30k environment interactions per second without reward model training and 26k interactions when training a classification-based reward model.

    The experimental results demonstrate significant performance improvements across multiple tasks in the NetHack Learning Environment. While the extrinsic reward agent performs adequately on the dense Score task, it fails on sparse reward tasks. ‘ONI-classification’ matches or approaches the performance of existing methods like Motif across most tasks, achieving this without pre-collected data or additional dense reward functions. Among ONI variants, ‘ONI-retrieval’ shows strong performance, while ‘ONI-classification’ consistently improves through its ability to generalize to unseen messages. Moreover, the ‘ONI-ranking’ achieves the highest experience levels, while ‘ONI-classification’ leads in other performance metrics in reward-free settings.

    In this paper, researchers introduced ONI which represents a significant advancement in RL by introducing a distributed system that simultaneously learns intrinsic rewards and agent behaviors online. It shows state-of-the-art performance across challenging sparse reward tasks in the NetHack Learning Environment while eliminating the need for pre-collected datasets or auxiliary dense reward functions that were previously essential. This work establishes a foundation for developing more autonomous intrinsic reward methods that can learn exclusively from agent experience, operate independently of external dataset constraints, and effectively integrate with high-performance reinforcement learning systems.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis Machine Learning Research from Amazon Introduces a New Open-Source High-Fidelity Dataset for Automotive Aerodynamics
    Next Article Meet CoMERA: An Advanced Tensor Compression Framework Redefining AI Model Training with Speed and Precision

    Related Posts

    Security

    ⚡ Weekly Recap: APT Intrusions, AI Malware, Zero-Click Exploits, Browser Hijacks and More

    June 2, 2025
    Security

    Qualcomm fixes three Adreno GPU zero-days exploited in attacks

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CoffeeLoader Uses GPU-Based Armoury Packer to Evade EDR and Antivirus Detection

    Development

    Gemini 2.5 Pro Preview: even better coding performance

    Artificial Intelligence

    Puppeteer vs Selenium – Which Is Better in 2024?

    Development

    CVE-2025-26997 – Validas Wireless Butler Cross-site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    News & Updates

    Cyberpunk 2077’s new Patch 2.21 adds DLSS 4 ahead of NVIDIA’s RTX 5000 GPU launch — here’s everything in the update

    January 23, 2025

    Patch 2.21 has come to Cyberpunk 2077, bringing DLSS 4 and plenty of great bugfixes…

    Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers

    April 8, 2024

    Windows 8 tiles were ahead of their time — The Xbox handheld could be the perfect place for a similar interface

    April 3, 2025

    ES6: Set Vs Array- What and When?

    May 19, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.