WEBRL: A Self-Evolving Online Curriculum Reinforcement Learning Framework for Training High-Performance Web Agents with Open LLMs

Large language models (LLMs) have shown exceptional capabilities in comprehending human language, reasoning, and knowledge acquisition, suggesting their potential to serve as autonomous agents. However, training high-performance web agents based on open LLMs within online environments, such as WebArena, faces several critical challenges. The first challenge is insufficient predefined training tasks in online benchmarks. The next challenge is assessing success for arbitrary web browsing tasks due to the sparsity and high cost of feedback signals. Lastly, the absence of a predefined training set necessitates online exploration, leading to policy distribution drift and potential catastrophic forgetting, which can decrease the agentâ€™s performance over time.

The existing methods include adopting LLMs as Agents and Reinforcement Learning (RL) for LLMs. Current research in LLMs as Agents has two main categories: training-free and training-based approaches. While some studies have used powerful LLMs like GPT-4 to generate demonstrations, the accuracy of these methods remains insufficient for complex tasks. Researchers have explored RL techniques to address this challenge, which uses sequential decision-making to control devices and interact with complex environments. Existing RL-based methods, such as AgentQ, which uses DPO for policy updates, and actor-critic architectures, have shown promise in complex device control tasks. However, the limited and sparse feedback signals are often binary success or failure after multiple interaction rounds in web-based tasks.

Researchers from Tsinghua University and Zhipu AI have proposed WEBRL, a self-evolving online curriculum RL framework designed to train high-performance web agents using open LLMs. It addresses the key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Moreover, it utilizes three key components:Â

A self-evolving curriculum that generates new tasks from unsuccessful attempts.
A robust outcome-supervised reward model (ORM)
Adaptive RL strategies to ensure consistent improvements.Â

Moreover, WEBRL bridges the gap between open and proprietary LLM-based web agents, creating a way for more accessible and powerful autonomous web interaction systems.

WEBRL utilizes a self-evolving online curriculum that harnesses the trial-and-error process inherent in exploration to address the scarcity of web agent training tasks. In each training phase, WEBRL autonomously generates novel tasks from unsuccessful attempts in the preceding phase, providing a progressive learning trajectory. It also incorporates a KL-divergence term between the reference and actor policies into its learning algorithm to reduce the policy distribution shift induced by curriculum-based RL. This constraint on policy updates promotes stability and prevents catastrophic forgetting. Moreover, WEBRL implements an experience replay buffer augmented with a novel actor confidence filtering strategy.

The results obtained for Llama-3.1-8B trained using WEBRL achieve an average accuracy of 42.4%, surpassing all the baseline approaches, including prompting and training alternatives. WEBRL excels in specific tasks such as Gitlab (46.7%) and CMS (54.3%), showcasing its ability to address complex web tasks effectively. Moreover, it outperforms imitation learning-based methods, such as SFT and Filtered BC. Moreover, it consistently outperforms DigiRL, a previous state-of-the-art method that conducts policy updates on a predefined, fixed set of tasks, which may not align with the modelâ€™s current skill level. WEBRL addresses this by using self-evolving curriculum learning, adjusting the task complexity based on the modelâ€™s abilities, promoting wider exploration, and supporting continuous improvement.

In this paper, the researchers have introduced WEBRL, a novel self-evolving online curriculum RL framework for training LLM-based web agents. It addresses the critical challenges in building effective LLM web agents, including the scarcity of training tasks, the sparsity of feedback signals, and the policy distribution drift in online learning. The results demonstrate that WEBRL enables LLM-based web agents to outperform existing state-of-the-art approaches, including proprietary LLM APIs, and these findings help enhance the capabilities of open-source LLMs for web-based tasks, paving the way for more accessible and powerful autonomous web interaction systems. The successful application of WEBRL across different LLM architectures, like Llama-3.1 and GLM-4 validates the robustness and adaptability of the proposed framework.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members

The post WEBRL: A Self-Evolving Online Curriculum Reinforcement Learning Framework for Training High-Performance Web Agents with Open LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

WEBRL: A Self-Evolving Online Curriculum Reinforcement Learning Framework for Training High-Performance Web Agents with Open LLMs

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

This AI Paper from UC Berkeley Introduces a Data-Efficient Approach to Long Chain-of-Thought Reasoning for Large Language Models

Leveraging BigQuery JSON for Optimized MongoDB Dataflow Pipelines

Commando Cat Cryptojacking Attacks Target Misconfigured Docker Instances

Why is White Box Testing Essential in Software Engineering?

What we shipped

CensysGPT: AI-Powered Threat Hunting for Cybersecurity Pros (Webinar)

Cloud Infrastructure Management Services Securing Digital Success

Linux Kernel 6.13 Released with Big Changes

WEBRL: A Self-Evolving Online Curriculum Reinforcement Learning Framework for Training High-Performance Web Agents with Open LLMs

Related Posts