This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks

Artificial intelligence (AI) is dedicated to developing systems capable of performing tasks that typically require human intelligence. This dedication is met with numerous challenges along the way. One such challenge in AI is creating systems that can manage complex, realistic tasks requiring extensive interaction with dynamic environments. These tasks often involve searching for and synthesizing information from the web, a process that current models need help to accomplish with high accuracy and reliability. This gap in capabilities highlights the need for more advanced AI systems.

Existing methods for addressing web-based tasks include closed-book language models (LMs) and retrieval-augmented LMs. Closed-book models rely solely on pre-existing knowledge encoded within their parameters, often resulting in hallucinations where the model generates incorrect information. Retrieval-augmented models attempt to gather and utilize relevant data from the web. However, the quality and relevance of the retrieved information can vary significantly, limiting the overall effectiveness of these models.

Researchers from Tel Aviv University, the University of Pennsylvania, the Allen Institute for AI, the University of Washington, and Princeton University have introduced a new benchmark called ASSISTANTBENCH to address these challenges, aimed at evaluating the capabilities of web agents in performing realistic, time-consuming web tasks. This benchmark consists of 214 diverse tasks that span various domains and require web-based interaction. Furthermore, researchers proposed SEEPLANACT (SPA), a novel web agent designed to enhance task performance by incorporating a planning component and a memory buffer.

SPA builds upon the existing SEEACT model, introducing several improvements to enhance web navigation and task execution. The planning component enables SPA to strategize its approach to each task, allowing it to re-plan and adjust its strategy dynamically based on interactions with web elements. The memory buffer retains information gathered during the task, enabling SPA to utilize this information effectively throughout the taskâ€™s duration. These enhancements allow SPA to interact more robustly with web elements, navigate dynamically, and adjust its plan as needed, providing a more effective solution for handling complex web tasks.

Performance evaluations of SPA on the ASSISTANTBENCH benchmark showed significant improvements over previous models. SPA achieved an accuracy score of 11 points, a substantial increase compared to the 4.2 points achieved by the earlier SEEACT model. Moreover, SPA demonstrated higher precision, with a 10-point increase in the number of correctly answered questions. This improvement was primarily due to SPAâ€™s enhanced ability to navigate web environments and utilize gathered information effectively. Despite these advancements, the overall accuracy of the best-performing models did not exceed 25%, highlighting the continued challenges in developing highly reliable web-based AI solutions.

In more detailed performance metrics, SPAâ€™s integration of planning and memory components allowed it to outperform other models in terms of answer rate and precision. SPAâ€™s answer rate was 38.8%, compared to the 20% achieved by the earlier SEEACT model. The precision of SPA was also higher, at 29.0%, compared to the 19.6% of SEEACT. Combining SPA with a closed-book model, the ensemble model achieved the best overall performance, with an accuracy of 25.2 points, further emphasizing SPAâ€™s effectiveness in improving task performance.

To conclude, this research underscores the critical challenges in developing AI systems capable of performing realistic, time-consuming web tasks. The introduction of ASSISTANTBENCH and SPA represents a significant step forward in addressing these challenges. However, a considerable gap remains in achieving reliable, high-precision AI solutions for web navigation, emphasizing the need for continued innovation and improvement in this field. The advancements made by the research teams from Tel Aviv University, the University of Pennsylvania, the Allen Institute for AI, the University of Washington, and Princeton University are promising but highlight the necessity for ongoing research and development to bridge the gap in web-based AI capabilities.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Get NordVPN free for three months with this early Black Friday deal

Dispel Appoints Dean Macris as Chief Information Security Officer

Windows 11 KB5039302 out with native archives (direct download .msu)

Two Russian Nationals Plead Guilty in LockBit Ransomware Attacks

Quantum Tunneling Meets AI: How Deep Neural Networks are Transforming Optical Applications

How to use the Photos app on your iPhone with iOS 18

It’s time to go ESM-only

This AI Paper from Shanghai AI Laboratory Introduces Lumina-mGPT: A High-Resolution Text-to-Image Generation Model with Multimodal Generative Pretraining

This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks

Related Posts