Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 14, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 14, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 14, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 14, 2025

      I test a lot of AI coding tools, and this stunning new OpenAI release just saved me days of work

      May 14, 2025

      How to use your Android phone as a webcam when your laptop’s default won’t cut it

      May 14, 2025

      The 5 most customizable Linux desktop environments – when you want it your way

      May 14, 2025

      Gen AI use at work saps our motivation even as it boosts productivity, new research shows

      May 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025
      Recent

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025

      Perficient’s “What If? So What?” Podcast Wins Gold at the 2025 Hermes Creative Awards

      May 14, 2025

      PIM for Azure Resources

      May 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025
      Recent

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025

      You can now share an app/browser window with Copilot Vision to help you with different tasks

      May 14, 2025

      Microsoft will gradually retire SharePoint Alerts over the next two years

      May 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Salesforce AI Research Introduces AGUVIS: A Unified Pure Vision Framework Transforming Autonomous GUI Interaction Across Platforms

    Salesforce AI Research Introduces AGUVIS: A Unified Pure Vision Framework Transforming Autonomous GUI Interaction Across Platforms

    December 24, 2024

    Graphical User Interfaces (GUIs) play a fundamental role in human-computer interaction, providing the medium through which users accomplish tasks across web, desktop, and mobile platforms. Automation in this field is transformative, potentially drastically improving productivity and enabling seamless task execution without requiring manual intervention. Autonomous agents capable of understanding and interacting with GUIs could revolutionize workflows, particularly in repetitive or complex task settings. However, GUIs’ inherent complexity and variability across platforms pose significant challenges. Each platform uses distinct visual layouts, action spaces, and interaction logic, making creating scalable and robust solutions difficult. Developing systems that can navigate these environments autonomously while generalizing across platforms remains an ongoing challenge for researchers in this domain.

    There are many technical hurdles in GUI automation right now; one is aligning natural language instructions with the diverse visual representations of GUIs. Traditional methods often rely on textual representations, such as HTML or accessibility trees, to model GUI elements. These approaches are limited because GUIs are inherently visual, and textual abstractions fail to capture the nuances of visual design. In addition, textual representations vary between platforms, leading to fragmented data and inconsistent performance. This mismatch between the visual nature of GUIs and the textual inputs used in automation systems results in reduced scalability, longer inference times, and limited generalization. Also, most current methods are incapable of effective multimodal reasoning and grounding, which are essential for understanding complex visual environments.

    Existing tools and techniques have attempted to address these challenges with mixed success. Many systems depend on closed-source models to enhance reasoning and planning capabilities. These models often use natural language communication to combine grounding and reasoning processes, but this approach introduces information loss and lacks scalability. Another common limitation is the fragmented nature of training datasets, which fail to provide comprehensive support for grounding and reasoning tasks. For instance, datasets typically emphasize either grounding or reasoning, but not both, leading to models that excel in one area while struggling in others. This division hampers the development of unified solutions for autonomous GUI interaction.

    The University of Hong Kong researchers and Salesforce Research introduced AGUVIS (7B and 72B), a unified framework designed to overcome these limitations by leveraging pure vision-based observations. AGUVIS eliminates the reliance on textual representations and instead focuses on image-based inputs, aligning the model’s structure with the visual nature of GUIs. The framework includes a consistent action space across platforms, facilitating cross-platform generalization. AGUVIS integrates explicit planning and multimodal reasoning to navigate complex digital environments. The researchers constructed a large-scale dataset of GUI agent trajectories, which was used to train AGUVIS in a two-stage process. The framework’s modular architecture, which includes a pluggable action system, allows for seamless adaptation to new environments and tasks.

    The AGUVIS framework employs a two-stage training paradigm to equip the model with grounding and reasoning capabilities: 

    1. During the first stage, the model focuses on grounding and mapping natural language instructions to visual elements within GUI environments. This stage utilizes a grounding packing strategy, bundling multiple instruction-action pairs into a single GUI screenshot. This method improves training efficiency by maximizing the utility of each image without sacrificing accuracy. 
    2. The second stage introduces planning and reasoning, training the model to execute multi-step tasks across various platforms and scenarios. This stage incorporates detailed inner monologues, which include observation descriptions, thoughts, and low-level action instructions. By progressively increasing the complexity of training data, the model learns to handle nuanced tasks with precision and adaptability.

    AGUVIS demonstrated great results in both offline and real-world online evaluations. In GUI grounding, the model achieved an average accuracy of 89.2, surpassing state-of-the-art methods across mobile, desktop, and web platforms. In online scenarios, AGUVIS outperformed competing models with a 51.9% improvement in step success rate during offline planning tasks. Also, the model achieved a 93% reduction in inference costs compared to GPT-4o. By focusing on visual observations and integrating a unified action space, AGUVIS sets a new benchmark for GUI automation, making it the first fully autonomous pure vision-based agent capable of completing real-world tasks without reliance on closed-source models.

    Key takeaways from the research on AGUVIS in the field of GUI automation:

    1. AGUVIS uses image-based inputs, reducing token costs significantly and aligning the model with the inherently visual nature of GUIs. This approach results in a token cost of only 1,200 for 720p image observations, compared to 6,000 for accessibility trees and 4,000 for HTML-based observations.
    2. The model combines grounding and planning stages, enabling it to perform single- and multi-step tasks effectively. The grounding training alone equips the model to process multiple instructions within a single image, while the reasoning stage enhances its ability to execute complex workflows.
    3. The AGUVIS Collection unifies and augments existing datasets with synthetic data to support multimodal reasoning and grounding. This results in a diverse and scalable dataset, enabling the training of robust and adaptable models.
    4. Using pyautogui commands and a pluggable action system allows the model to generalize across platforms while accommodating platform-specific actions, such as swiping on mobile devices.
    5. AGUVIS achieved remarkable results in GUI grounding benchmarks, with accuracy rates of 88.3% on web platforms, 85.7% on mobile, and 81.8% on desktops. Also, it demonstrated superior efficiency, reducing USD inference costs by 93% compared to existing models.

    In conclusion, the AGUVIS framework addresses critical challenges in grounding, reasoning, and generalization in GUI automation. Its purely vision-based approach eliminates the inefficiencies associated with textual representations, while its unified action space enables seamless interaction across diverse platforms. The research provides a robust solution for autonomous GUI tasks, with applications ranging from productivity tools to advanced AI systems.


    Check out the Paper, GitHub Page, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post Salesforce AI Research Introduces AGUVIS: A Unified Pure Vision Framework Transforming Autonomous GUI Interaction Across Platforms appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleFrenzy: A Memory-Aware Serverless Computing Method for Heterogeneous GPU Clusters
    Next Article This AI Paper Introduces ROMAS: A Role-Based Multi-Agent System for Efficient Database Monitoring and Planning

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 15, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-30419 – NI Circuit Design Suite SymbolEditor Out-of-Bounds Read Vulnerability

    May 15, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Microsoft Edge drops yellow for folders (favourites), gets monoline transparent look

    Development

    13 Free AI Courses on AI Agents in 2025

    Development

    Researchers Uncover 4-Month Cyberattack on U.S. Firm Linked to Chinese Hackers

    Development

    Archana Venugopal Takes Charge as Senior VP & CISO at National Commodity Exchange

    Development

    Highlights

    Development

    NCB Buenos Aires Faces Alleged Threat from XSS and CSRF Vulnerabilities

    June 25, 2024

    The National Central Bureau (NCB) Buenos Aires, a vital division of Interpol in Argentina, has…

    CVE-2025-46565 – Vite File Pattern Denial of Service

    May 1, 2025

    Windows 11 to Deprecate NTLM, Add AI-Powered App Controls and Security Defenses

    May 21, 2024

    Digital Health Healthcare – Digitization, Innovations & the Roadmap

    August 8, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.