Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent

    Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent

    February 19, 2025

    In the realm of artificial intelligence, enabling Large Language Models (LLMs) to navigate and interact with graphical user interfaces (GUIs) has been a notable challenge. While LLMs are adept at processing textual data, they often encounter difficulties when interpreting visual elements like icons, buttons, and menus. This limitation restricts their effectiveness in tasks that require seamless interaction with software interfaces, which are predominantly visual.

    To address this issue, Microsoft has introduced OmniParser V2, a tool designed to enhance the GUI comprehension capabilities of LLMs. OmniParser V2 converts UI screenshots into structured, machine-readable data, enabling LLMs to understand and interact with various software interfaces more effectively. This development aims to bridge the gap between textual and visual data processing, facilitating more comprehensive AI applications.

    OmniParser V2 operates through two main components: detection and captioning. The detection module employs a fine-tuned version of the YOLOv8 model to identify interactive elements within a screenshot, such as buttons and icons. Simultaneously, the captioning module uses a fine-tuned Florence-2 base model to generate descriptive labels for these elements, providing context about their functions within the interface. This combined approach allows LLMs to construct a detailed understanding of the GUI, which is essential for accurate interaction and task execution.

    A significant improvement in OmniParser V2 is the enhancement of its training datasets. The tool has been trained on a more extensive and refined set of icon captioning and grounding data, sourced from widely used web pages and applications. This enriched dataset enhances the model’s accuracy in detecting and describing smaller interactive elements, which are crucial for effective GUI interaction. Additionally, by optimizing the image size processed by the icon caption model, OmniParser V2 achieves a 60% reduction in latency compared to its previous version, with an average processing time of 0.6 seconds per frame on an A100 GPU and 0.8 seconds on a single RTX 4090 GPU.

    The effectiveness of OmniParser V2 is demonstrated through its performance on the ScreenSpot Pro benchmark, an evaluation framework for GUI grounding capabilities. When combined with GPT-4o, OmniParser V2 achieved an average accuracy of 39.6%, a notable increase from GPT-4o’s baseline score of 0.8%. This improvement highlights the tool’s ability to enable LLMs to accurately interpret and interact with complex GUIs, even those with high-resolution displays and small target icons.

    To support integration and experimentation, Microsoft has developed OmniTool, a dockerized Windows system that incorporates OmniParser V2 along with essential tools for agent development. OmniTool is compatible with various state-of-the-art LLMs, including OpenAI’s 4o/o1/o3-mini, DeepSeek’s R1, Qwen’s 2.5VL, and Anthropic’s Sonnet. This flexibility allows developers to utilize OmniParser V2 across different models and applications, simplifying the creation of vision-based GUI agents.

    In summary, OmniParser V2 represents a meaningful advancement in integrating LLMs with graphical user interfaces. By converting UI screenshots into structured data, it enables LLMs to comprehend and interact with software interfaces more effectively. The technical enhancements in detection accuracy, latency reduction, and benchmark performance make OmniParser V2 a valuable tool for developers aiming to create intelligent agents capable of navigating and manipulating GUIs autonomously. As AI continues to evolve, tools like OmniParser V2 are essential in bridging the gap between textual and visual data processing, leading to more intuitive and capable AI systems.


    Check out the Technical Details, Model on HF and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBest practices for Amazon SageMaker HyperPod task governance
    Next Article From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Redefining creativity with Substance 3D tools, stunning designs, and artistic triumphs

    Development

    A quick look at Xbox’s new “Game Hubs,” a small new feature that may be more meaningful than you think

    News & Updates

    CodeSOD: A Type of Alias

    Development

    Samsung’s next-gen Galaxy Ring 2 may launch at Unpacked next month

    Development

    Highlights

    Development

    OpenLS-DGF: An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic Synthesis

    November 24, 2024

    Logic synthesis is one of the important steps in designing digital circuits, in which high-level…

    AMD VP says the delayed Radeon 9000 GPU launch is about “taking a little extra time to optimize the software stack for maximum performance.”

    January 23, 2025

    Julie Shah named head of the Department of Aeronautics and Astronautics

    April 29, 2024

    Threads will show you more from accounts you follow now – like Bluesky already does

    November 21, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.