Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 14, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 14, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 14, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 14, 2025

      I test a lot of AI coding tools, and this stunning new OpenAI release just saved me days of work

      May 14, 2025

      How to use your Android phone as a webcam when your laptop’s default won’t cut it

      May 14, 2025

      The 5 most customizable Linux desktop environments – when you want it your way

      May 14, 2025

      Gen AI use at work saps our motivation even as it boosts productivity, new research shows

      May 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025
      Recent

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025

      Perficient’s “What If? So What?” Podcast Wins Gold at the 2025 Hermes Creative Awards

      May 14, 2025

      PIM for Azure Resources

      May 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025
      Recent

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025

      You can now share an app/browser window with Copilot Vision to help you with different tasks

      May 14, 2025

      Microsoft will gradually retire SharePoint Alerts over the next two years

      May 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»UBC Researchers Introduce ‘First Explore’: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations

    UBC Researchers Introduce ‘First Explore’: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations

    December 17, 2024

    Reinforcement Learning is now applied in almost every pursuit of science and tech, either as a core methodology or to optimize existing processes and systems. Despite broad adoption even in highly advanced fields, RL lags in some fundamental skills. Sample Inefficiency is one such problem that limits its potential. In simple terms, RL needs thousands of episodes to learn reasonably basic tasks, such as exploration, that humans master in just a few shots (for example, assume a kid finally figuring out basic arithmetic in high school). Meta-RL circumvents the above problem by enabling an agent with prior experience. The agent remembers the events of previous episodes to adapt to new environments and achieve sample efficiency. Meta-RL is better than standard RL as it learns to explore and learns highly complex strategies far beyond the ability of standard RL, like learning new skills or conducting experiments to learn about the current environment.

    Having discussed how good the memory-based Meta-RL is in the RL space, let’s discuss what limits it. Traditional Meta-RL approaches aim to maximize the cumulative reward across all the episodes in a sequence of consideration, which means it finds an optimal balance between exploration and exploitation. Generally, this balance means prioritizing exploration in early episodes to exploit them later. The problem now is that even state-of-the-art methods get stuck on local optimums while exploring, especially when an agent must sacrifice immediate reward in the quest for subsequent higher reward. In this article, we discuss the latest study that claims to be able to remove this problem from Meta-RL.

    Researchers at the University of British Columbia presented “First-Explore, Then Exploit,” a Meta-RL approach that differentiates exploration and exploitation by learning two distinct policies. The explore policy first informs the exploit policy, which maximizes episode return; neither attempt to maximize individual returns but are combined post-training to maximize cumulative reward. As the exploration policy is trained solely to inform the exploit policy, poor current exploitation no longer causes immediate rewards to discourage exploration. The explore policy first performs successive episodes where it is provided with the context of the current exploration sequence, which includes previous actions, rewards, and observations. It is incentivized to produce episodes that, when added to the current context, result in subsequent high-return exploit-policy episodes. The exploit policy then takes context from the explore policy for n episodes to produce high-return episodes.

    The official implementation of First-Explore is done in a GPT-2-style causal transformer architecture. Both policies share similar parameters and differ only in the final layer head.

    For experimentation, the authors compared First-Explore against three RL environments: Bandits with One Fixed Arm, Dark Treasure Rooms, and Ray Maze, all of varying challenges. The One Arm Fixed Bandit is a multi-armed bandit problem designed to forgo immediate reward while having no exploratory value. The second domain is a grid world environment, where an agent who cannot see its surroundings looks for randomly positioned rewards. The final environment is the most challenging of all and also highlights the learning capabilities of First-Explore beyond Meta-RL. It consisted of randomly generated mazes with three reward positions.

    First-Explore achieved twice the total rewards of meta-RL approaches in the domain of the Fixed Arm Bandit. This number further soared 10 times for the second environment and 6 times for the last. Besides Meta-RL approaches, First-Explore also substantially outperformed other RL methods when it came to forgoing immediate reward.

    Conclusion: First- Explore posed an effective solution to the immediate reward problem plagues traditional meta-RL approaches. It bifurcated exploration and exploitation to learn two independent policies that, combined with post-training, maximized cumulative good, which meta-RL was unable to achieve regardless of the training method. However, it also faces some challenges, paving the way for future research. Among these challenges were the inability to explore the future, disregard for negative rewards, and long-sequence modeling. In the future, it will be interesting to see how these problems are resolved and if they positively impact the efficiency of RL in general.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post UBC Researchers Introduce ‘First Explore’: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding
    Next Article Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 15, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-30419 – NI Circuit Design Suite SymbolEditor Out-of-Bounds Read Vulnerability

    May 15, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    The best-looking Linux desktop I’ve seen so far in 2025 – and it’s not even close

    News & Updates

    These Google Pixel buds have replaced over-ear headphones for me when traveling – here’s why

    Development

    Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback

    Development

    Windows 11 Shutdown But User Stays Logged in: How to Fix it

    Operating Systems

    Highlights

    Artificial Intelligence

    GTA 7: Potential Insights on GTA 7 from Take-Two CEO’s Interview

    July 1, 2024

    Start Your Own ChatGPT Office with AI Agents: Revolutionize Your Business with Intelligent Virtual Assistants…

    Got a PC with a 13th or 14th gen Intel Core CPU? You need to read this

    July 29, 2024

    Documentation that drives adoption

    January 24, 2025

    Level Up Your Coding: Get Your AI Pair Programmer with Magicode 🚀

    July 11, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.