Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This AI Paper Explores the Extent to which LLMs can Self-Improve their Performance as Agents in Long-Horizon Tasks in a Complex Environment Using the WebArena Benchmark

    This AI Paper Explores the Extent to which LLMs can Self-Improve their Performance as Agents in Long-Horizon Tasks in a Complex Environment Using the WebArena Benchmark

    June 3, 2024

    Large language models (LLMs) have shown their potential in many natural language processing (NLP) tasks, like summarization and question answering using zero-shot and few-shot prompting approaches. However, prompting alone is not enough to make LLMs work as agents who can navigate environments to solve complex and multi-step. Fine-tuning LLMs for these tasks is also impractical due to the unavailability of training data. Collecting data for tasks that require decision-making and complex interactions is both time-consuming and costly. Also, the automatic sequence evaluation of actions taken by an agent is challenging, and the bad metrics make it hard to check whether the agent’s performance is improving or getting worse. 

    For self-improving LLMs, various methods have been proposed that contain self-distillation where the teacher and student are the same models. LLM Agent’s performance can be improved using multiple prompting approaches but shows orthogonality to self-improvement fine-tuning. Further, agents of self-improving show how complex robotics tasks can be solved by learning and improving on their own. A method discussed in this paper shows filtering trajectories and fine-tuning. However, importance is given to the supervised filtering that does not explore generating novel tasks and synthetic data

    Researchers from the University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, and NXAI introduced new techniques that allow LLM agents to solve complex and multistep tasks through self-improvement. All the distinct techniques for self-improvement involve fine-tuning the LLM agent and releasing a signal to learn through unsupervised methods such as self-critique to filter training examples. For a detailed understanding of the impact of self-improvement, two auxiliary metrics are introduced: (a) a measure to analyze capabilities gained and lost by the agent, and (b) an extension of the VERTEX score to measure the quality of agent trajectories of different lengths. 

    The metrics used in understanding the impact of self-improvement allow for small changes, both good and bad, better than the overall benchmark scores. Moreover, researchers performed multiple experiments to fine-tune agent models on the synthetic training data mixtures and determine the self-improvement of the agent model over the base agent model through evaluation metrics. The performance of the baseline agent is measured and a trivial agent is implemented that always outputs stop [N/A]. The trivial agent baseline helps to recognize the completed tasks that contribute to an agent, which has the potential to calculate the capability score.

    The result of the experiment shows that models can self-improve at web agent tasks and enhance overall benchmark performance with the best-performing mixture by solving 18 tasks correctly with a relative improvement of 31%. Further, the results also show that the self-improved agents can gain new capabilities with the help of self-improvement while losing a few abilities to perform some capabilities. In this case, two mixtures are fine-tuned to enhance the capability score and show 5 more capabilities than the base agent model, with a relative improvement of 24%.

    In conclusion, researchers introduced new techniques that allow LLM agents to solve complex and multistep tasks through self-improvement. Self-improvement helps to enhance the performance of agent models and allows to gain new capabilities, having minimal decrease in the quality of trajectories to provide these benefits. However, this paper possesses some limitations in fine-tuning techniques for self-improvement. Their performance is improved by reinforcing correct actions and decisions of an underlying model, however, these techniques can further reinforce incorrect actions and biases of the underlying model. This limitation can be reduced with the help of human or supervised filtering. 

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

    The post This AI Paper Explores the Extent to which LLMs can Self-Improve their Performance as Agents in Long-Horizon Tasks in a Complex Environment Using the WebArena Benchmark appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLLM-QFA Framework: A Once-for-All Quantization-Aware Training Approach to Reduce the Training Cost of Deploying Large Language Models (LLMs) Across Diverse Scenarios
    Next Article BRIDGETOWER: A Novel Transformer-based Vision-Language VL Model that Takes Full Advantage of the Features of Different Layers in Pre-Trained Uni-Modal Encoders

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Microsoft tries again with aggressive tactics against Google and Chrome, touting Edge’s “world-class performance” with a pop-up ad promo

    News & Updates

    Evaluating potential cybersecurity threats of advanced AI

    Artificial Intelligence

    Exploring Astro: The Future of Web Development with Server Islands

    Development

    CVE-2025-47679 – RS WP Book Showcase Cross-site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)
    Hostinger

    Highlights

    Development

    Laravel Microsoft Graph

    January 2, 2025

    The Microsoft Graph is a powerful tool that allows developers to access and utilize the…

    LoRA-Pro: A Groundbreaking Machine Learning Approach to Bridging the Performance Gap Between Low-Rank Adaptation and Full Fine-Tuning

    July 28, 2024

    Perficient’s Detroit Office named a 2024 Top Workplace!

    November 18, 2024

    Hugging Face Releases SmolTools: A Collection of Lightweight AI-Powered Tools Built with LLaMA.cpp and Small Language Models

    November 6, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.