Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Designing Better UX For Left-Handed People

      July 25, 2025

      This week in AI dev tools: Gemini 2.5 Flash-Lite, GitLab Duo Agent Platform beta, and more (July 25, 2025)

      July 25, 2025

      Tenable updates Vulnerability Priority Rating scoring method to flag fewer vulnerabilities as critical

      July 24, 2025

      Google adds updated workspace templates in Firebase Studio that leverage new Agent mode

      July 24, 2025

      Trump’s AI plan says a lot about open source – but here’s what it leaves out

      July 25, 2025

      Google’s new Search mode puts classic results back on top – how to access it

      July 25, 2025

      These AR swim goggles I tested have all the relevant metrics (and no subscription)

      July 25, 2025

      Google’s new AI tool Opal turns prompts into apps, no coding required

      July 25, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Scoped Route Binding for Nested Resource Management

      July 25, 2025
      Recent

      Laravel Scoped Route Binding for Nested Resource Management

      July 25, 2025

      Add Reactions Functionality to Your App With Laravel Reactions

      July 25, 2025

      saasykit/laravel-open-graphy

      July 25, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Sam Altman won’t trust ChatGPT with his “medical fate” unless a doctor is involved — “Maybe I’m a dinosaur here”

      July 25, 2025
      Recent

      Sam Altman won’t trust ChatGPT with his “medical fate” unless a doctor is involved — “Maybe I’m a dinosaur here”

      July 25, 2025

      “It deleted our production database without permission”: Bill Gates called it — coding is too complex to replace software engineers with AI

      July 25, 2025

      Top 6 new features and changes coming to Windows 11 in August 2025 — from AI agents to redesigned BSOD screens

      July 25, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

    GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

    July 24, 2025

    Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations. 

    MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases. 

    Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced. 

    To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs’ strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting. 

    The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance. 

    In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs’ visual understanding, laying the groundwork for future advancements. 


    Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

    Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

    The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks
    Next Article SYNCOGEN: A Machine Learning Framework for Synthesizable 3D Molecular Generation Through Joint Graph and Coordinate Modeling

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 25, 2025
    Machine Learning

    Unsupervised System 2 Thinking: The Next Leap in Machine Learning with Energy-Based Transformers

    July 25, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-49697 – Microsoft Office Heap Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Sophos Intercept X for Windows Vulnerabilities Enable Arbitrary Code Execution

    Security

    CVE-2025-51865 – Allenai Ai2 Playground Web Service IDOR

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48236 – Bunny.net Cross-site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2014-6274 – Git-Annex AWS S3 and Glacier Unencrypted Credentials Storage Vulnerability

    June 26, 2025

    CVE ID : CVE-2014-6274

    Published : June 26, 2025, 9:15 p.m. | 4 hours, 19 minutes ago

    Description : git-annex had a bug in the S3 and Glacier remotes where if embedcreds=yes
    was set, and the remote used encryption=pubkey or encryption=hybrid,
    the embedded AWS credentials were stored in the git repository
    in (effectively) plaintext, not encrypted as they were supposed to be. This issue affects git-annex: from 3.20121126 before 5.20140919.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-49601 – MbedTLS LMS Public Key Buffer Out-of-Bounds Read

    July 4, 2025

    PSA: I need people to know the ROG ‘Xbox Ally’ is Xbox by name and software experience only — It is a full Windows PC, you can install *anything* you want on it

    June 10, 2025

    CVE-2025-7412 – “Code-Projects Library System Unrestricted File Upload Vulnerability”

    July 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.