GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations.

MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases.

Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced.

To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs’ strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting.

The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance.

In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs’ visual understanding, laying the groundwork for future advancements.

Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.

Source: Read MoreÂ

Designing Better UX For Left-Handed People

This week in AI dev tools: Gemini 2.5 Flash-Lite, GitLab Duo Agent Platform beta, and more (July 25, 2025)

Tenable updates Vulnerability Priority Rating scoring method to flag fewer vulnerabilities as critical

Google adds updated workspace templates in Firebase Studio that leverage new Agent mode

Trump’s AI plan says a lot about open source – but here’s what it leaves out

Google’s new Search mode puts classic results back on top – how to access it

These AR swim goggles I tested have all the relevant metrics (and no subscription)

Google’s new AI tool Opal turns prompts into apps, no coding required

Laravel Scoped Route Binding for Nested Resource Management

Laravel Scoped Route Binding for Nested Resource Management

Add Reactions Functionality to Your App With Laravel Reactions

saasykit/laravel-open-graphy

Sam Altman won’t trust ChatGPT with his “medical fate” unless a doctor is involved — “Maybe I’m a dinosaur here”

Sam Altman won’t trust ChatGPT with his “medical fate” unless a doctor is involved — “Maybe I’m a dinosaur here”

“It deleted our production database without permission”: Bill Gates called it — coding is too complex to replace software engineers with AI

Top 6 new features and changes coming to Windows 11 in August 2025 — from AI agents to redesigned BSOD screens

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Unsupervised System 2 Thinking: The Next Leap in Machine Learning with Energy-Based Transformers

CVE-2025-49697 – Microsoft Office Heap Buffer Overflow Vulnerability

Sophos Intercept X for Windows Vulnerabilities Enable Arbitrary Code Execution

CVE-2025-51865 – Allenai Ai2 Playground Web Service IDOR

CVE-2025-48236 – Bunny.net Cross-site Scripting (XSS)

CVE-2014-6274 – Git-Annex AWS S3 and Glacier Unencrypted Credentials Storage Vulnerability

CVE-2025-49601 – MbedTLS LMS Public Key Buffer Out-of-Bounds Read

PSA: I need people to know the ROG ‘Xbox Ally’ is Xbox by name and software experience only — It is a full Windows PC, you can install anything you want on it

CVE-2025-7412 – “Code-Projects Library System Unrestricted File Upload Vulnerability”

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

Related Posts