ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments

GUI agents face three critical challenges in professional environments: (1) the greater complexity of professional applications compared to general-use software, requiring detailed comprehension of intricate layouts; (2) the higher resolution of professional tools, resulting in smaller target sizes and reduced grounding accuracy; and (3) the reliance on additional tools and documents, adding complexity to workflows. These challenges highlight the need for advanced benchmarks and solutions to enhance GUI agent performance in these demanding scenarios.

Current GUI grounding models and benchmarks are insufficient to fulfill professional environment requirements. Tools like ScreenSpot are designed for low-resolution tasks and lack the variety to simulate real-world scenarios accurately. Models such as OS-Atlas and UGround are computationally inefficient and fail when the targets are small or the interface is icon-rich, which is common in professional applications. In addition, the absence of multilingual support reduces their applicability in global workflows. These shortcomings highlight the need for more comprehensive and realistic benchmarks to further this field.

A team of researchers from the National University of Singapore, East China Normal University, and Hong Kong Baptist University introduce ScreenSpot-Pro: a new framework that is tailored to professional high-resolution environments. This benchmark has a dataset of 1,581 tasks across 23 applications in industries such as development, creative tools, CAD, scientific platforms, and office suites. It incorporates high-resolution, full-screen visuals and expert annotations that ensure accuracy and realism. Multilingual guidelines encompass both English and Chinese for an expanded range of evaluation. ScreenSpot-Pro is unique as it documents the actual workflows that result in real, high-quality annotations, therefore serving as a tool for the full assessment and development of GUI grounding models.

The dataset ScreenSpot-Pro captures realistic and challenging scenarios. The base of this dataset is formed by high-resolution images, where the target regions form an average of only 0.07% of the total screen, thus pointing to subtle and small GUI elements. Data was collected by professional users with experience in relevant applications, who used specialized tools to ensure accurate annotations. Additionally, the dataset supports multilingual capabilities to test bilingual functionality and contains several workflows to capture the subtleties of real professional tasks. These characteristics render it particularly advantageous for the assessment and enhancement of the accuracy and flexibility of GUI agents.

The analysis of current GUI grounding models utilizing ScreenSpot-Pro reveals considerable deficiencies in their capacity to manage high-resolution professional settings. OS-Atlas-7B attained the greatest accuracy rate of 18.9%. However, iterative methodologies, exemplified by ReGround, demonstrated the capacity to enhance performance, reaching an accuracy of 40.2% by fine-tuning predictions through a multi-step methodology. Minor components, such as icons, presented significant difficulties, whereas bilingual assignments further highlighted the limitations of the models. These findings emphasize the necessity for improved techniques that bolster contextual comprehension and resilience in intricate GUI situations.

ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in professional high-resolution environments. It addresses the specific challenges in complex workflows, offering a diverse and precise dataset to guide innovations in GUI grounding. This contribution forms the foundation of much smarter and more efficient agents that support a seamless performance of professional tasks, significantly boosting productivity and innovation in all industry fields.

Check out the Paper and Data. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

Alienware’s rumored laptop could be the first to feature NVIDIA’s revolutionary Arm-based APU

easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

From Kitchen To Conversion

Perficient Included in Forrester’s AI Technical Services Landscape, Q2 2025

SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments

Actively Exploited Qualcomm GPU Zero-Days Added to CISA’s KEV Catalog

HPE Issues Security Patch for StoreOnce Bug Allowing Remote Authentication Bypass

Microsoft releases updated HLK and VHLK for Windows 11 24H2 & Server 2025

Ensuring Success: The Role of QA in Dynamics 365 Implementation

FreeAskInternet: A Free, Private, and Locally Running Search Aggregator and Answer Generate Using Multi LLMs without GPU Needed

CVE-2024-56156 – Halo File Type Validation Bypass Vulnerability

Researchers at Cambridge Provide Empirical Insights into Deep Learning through the Pedagogical Lens of Telescopic Model that Uses First-Order Approximations

Top 9 iOS Mobile Testing Tools: Comprehensive Guide for 2024

TypeScript: The problem with function overloads

Elegant Long Press Implementation Using CSS Animation

ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments

Related Posts