Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025

      These solid-state fans will revolutionize cooling in our PCs and laptops

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025
      Recent

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025

      A Comprehensive Guide to Azure Firewall

      June 3, 2025

      Test Job Failures Precisely with Laravel’s assertFailedWith Method

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025
      Recent

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent

    Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent

    February 19, 2025

    In the realm of artificial intelligence, enabling Large Language Models (LLMs) to navigate and interact with graphical user interfaces (GUIs) has been a notable challenge. While LLMs are adept at processing textual data, they often encounter difficulties when interpreting visual elements like icons, buttons, and menus. This limitation restricts their effectiveness in tasks that require seamless interaction with software interfaces, which are predominantly visual.

    To address this issue, Microsoft has introduced OmniParser V2, a tool designed to enhance the GUI comprehension capabilities of LLMs. OmniParser V2 converts UI screenshots into structured, machine-readable data, enabling LLMs to understand and interact with various software interfaces more effectively. This development aims to bridge the gap between textual and visual data processing, facilitating more comprehensive AI applications.

    OmniParser V2 operates through two main components: detection and captioning. The detection module employs a fine-tuned version of the YOLOv8 model to identify interactive elements within a screenshot, such as buttons and icons. Simultaneously, the captioning module uses a fine-tuned Florence-2 base model to generate descriptive labels for these elements, providing context about their functions within the interface. This combined approach allows LLMs to construct a detailed understanding of the GUI, which is essential for accurate interaction and task execution.

    A significant improvement in OmniParser V2 is the enhancement of its training datasets. The tool has been trained on a more extensive and refined set of icon captioning and grounding data, sourced from widely used web pages and applications. This enriched dataset enhances the model’s accuracy in detecting and describing smaller interactive elements, which are crucial for effective GUI interaction. Additionally, by optimizing the image size processed by the icon caption model, OmniParser V2 achieves a 60% reduction in latency compared to its previous version, with an average processing time of 0.6 seconds per frame on an A100 GPU and 0.8 seconds on a single RTX 4090 GPU.

    The effectiveness of OmniParser V2 is demonstrated through its performance on the ScreenSpot Pro benchmark, an evaluation framework for GUI grounding capabilities. When combined with GPT-4o, OmniParser V2 achieved an average accuracy of 39.6%, a notable increase from GPT-4o’s baseline score of 0.8%. This improvement highlights the tool’s ability to enable LLMs to accurately interpret and interact with complex GUIs, even those with high-resolution displays and small target icons.

    To support integration and experimentation, Microsoft has developed OmniTool, a dockerized Windows system that incorporates OmniParser V2 along with essential tools for agent development. OmniTool is compatible with various state-of-the-art LLMs, including OpenAI’s 4o/o1/o3-mini, DeepSeek’s R1, Qwen’s 2.5VL, and Anthropic’s Sonnet. This flexibility allows developers to utilize OmniParser V2 across different models and applications, simplifying the creation of vision-based GUI agents.

    Hostinger

    In summary, OmniParser V2 represents a meaningful advancement in integrating LLMs with graphical user interfaces. By converting UI screenshots into structured data, it enables LLMs to comprehend and interact with software interfaces more effectively. The technical enhancements in detection accuracy, latency reduction, and benchmark performance make OmniParser V2 a valuable tool for developers aiming to create intelligent agents capable of navigating and manipulating GUIs autonomously. As AI continues to evolve, tools like OmniParser V2 are essential in bridging the gap between textual and visual data processing, leading to more intuitive and capable AI systems.


    Check out the Technical Details, Model on HF and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBest practices for Amazon SageMaker HyperPod task governance
    Next Article From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 3, 2025
    Machine Learning

    This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning

    June 3, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-48485 – FreeScout Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    The sustainable web design checklist

    Development

    Create a 360-degree master data management patient view solution using Amazon Neptune and generative AI

    Databases

    Black – Python code formatter

    Linux

    Highlights

    Development

    Empower Your Business with AI at the Salesforce Manufacturing Summit

    January 8, 2025

    The future of the manufacturing industry is here, and Perficient is excited to be part…

    How to make a Java library jar file which multiple projects use and how to push latest code changes?

    August 1, 2024

    Fixed: Safe Exam Browser application might fail to open in Windows 11

    May 17, 2025

    Best Buy has discounted the Microsoft Xbox Series X for Cyber Monday – here’s what’s in stock

    December 1, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.