UBC Researchers Introduce â€˜First Exploreâ€™: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations

Reinforcement Learning is now applied in almost every pursuit of science and tech, either as a core methodology or to optimize existing processes and systems. Despite broad adoption even in highly advanced fields, RL lags in some fundamental skills. Sample Inefficiency is one such problem that limits its potential. In simple terms, RL needs thousands of episodes to learn reasonably basic tasks, such as exploration, that humans master in just a few shots (for example, assume a kid finally figuring out basic arithmetic in high school). Meta-RL circumvents the above problem by enabling an agent with prior experience. The agent remembers the events of previous episodes to adapt to new environments and achieve sample efficiency. Meta-RL is better than standard RL as it learns to explore and learns highly complex strategies far beyond the ability of standard RL, like learning new skills or conducting experiments to learn about the current environment.

Having discussed how good the memory-based Meta-RL is in the RL space, letâ€™s discuss what limits it. Traditional Meta-RL approaches aim to maximize the cumulative reward across all the episodes in a sequence of consideration, which means it finds an optimal balance between exploration and exploitation. Generally, this balance means prioritizing exploration in early episodes to exploit them later. The problem now is that even state-of-the-art methods get stuck on local optimums while exploring, especially when an agent must sacrifice immediate reward in the quest for subsequent higher reward. In this article, we discuss the latest study that claims to be able to remove this problem from Meta-RL.

Researchers at the University of British Columbia presented â€œFirst-Explore, Then Exploit,â€ a Meta-RL approach that differentiates exploration and exploitation by learning two distinct policies. The explore policy first informs the exploit policy, which maximizes episode return; neither attempt to maximize individual returns but are combined post-training to maximize cumulative reward. As the exploration policy is trained solely to inform the exploit policy, poor current exploitation no longer causes immediate rewards to discourage exploration. The explore policy first performs successive episodes where it is provided with the context of the current exploration sequence, which includes previous actions, rewards, and observations. It is incentivized to produce episodes that, when added to the current context, result in subsequent high-return exploit-policy episodes. The exploit policy then takes context from the explore policy for n episodes to produce high-return episodes.

The official implementation of First-Explore is done in a GPT-2-style causal transformer architecture. Both policies share similar parameters and differ only in the final layer head.

For experimentation, the authors compared First-Explore against three RL environments: Bandits with One Fixed Arm, Dark Treasure Rooms, and Ray Maze, all of varying challenges. The One Arm Fixed Bandit is a multi-armed bandit problem designed to forgo immediate reward while having no exploratory value. The second domain is a grid world environment, where an agent who cannot see its surroundings looks for randomly positioned rewards. The final environment is the most challenging of all and also highlights the learning capabilities of First-Explore beyond Meta-RL. It consisted of randomly generated mazes with three reward positions.

First-Explore achieved twice the total rewards of meta-RL approaches in the domain of the Fixed Arm Bandit. This number further soared 10 times for the second environment and 6 times for the last. Besides Meta-RL approaches, First-Explore also substantially outperformed other RL methods when it came to forgoing immediate reward.

Conclusion: First- Explore posed an effective solution to the immediate reward problem plagues traditional meta-RL approaches. It bifurcated exploration and exploitation to learn two independent policies that, combined with post-training, maximized cumulative good, which meta-RL was unable to achieve regardless of the training method. However, it also faces some challenges, paving the way for future research. Among these challenges were the inability to explore the future, disregard for negative rewards, and long-sequence modeling. In the future, it will be interesting to see how these problems are resolved and if they positively impact the efficiency of RL in general.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. Donâ€™t Forget to join ourÂ 60k+ ML SubReddit.

The post UBC Researchers Introduce â€˜First Exploreâ€™: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

UBC Researchers Introduce â€˜First Exploreâ€™: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

This AI Paper Introduces a Novel DINOv2-LLaVA Framework: Advanced Vision-Language Model for Automated Radiology Report Generation

Linux 6.12 Kernel Confirmed as Long-Term Support Version

Advancing AI trust with new responsible AI tools, capabilities, and resources

Japanese Experts Warn of BLOODALCHEMY Malware Targeting Government Agencies

Using & Styling the Details Element

CISA Warns of Active Exploitation of Flaws in Zyxel, ProjectSend, and CyberPanel

New tool – Advanced data-focused fitness logging for athletes

Linux Malware Campaign Uses Discord Emojis in Attack on Indian Government Targets

UBC Researchers Introduce â€˜First Exploreâ€™: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations

Related Posts