Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Are We on the Right Way for Evaluating Large Vision-Language Models? This AI Paper from China Introduces MMStar: An Elite Vision-Dependent Multi-Modal Benchmark

    Are We on the Right Way for Evaluating Large Vision-Language Models? This AI Paper from China Introduces MMStar: An Elite Vision-Dependent Multi-Modal Benchmark

    April 3, 2024

    Large vision language models (LVLMs) showcase powerful visual perception and understanding capabilities. These achievements have further inspired the research community to develop a variety of multi-modal benchmarks constructed to explore the powerful capabilities emerging from LVLMs and provide a comprehensive and objective platform for quantitatively comparing the continually evolving models. However, after careful evaluation, the researchers identified two primary issues:
    1) Visual content is unnecessary for many samples, and
    2) Unintentional data leakage exists in LLM and LVLM training. 

    Early single-task benchmarks, such as VQA, MS-COCO, and OK-VQA, fail to holistically assess LVLMs’ general multi-modal perception and reasoning capabilities. To address this issue, comprehensive multi-modal benchmarks have been constructed. For example, SEED, MMBench, and MMMU provide competitive arenas for comprehensively comparing cutting-edge LVLMs. However, existing evaluations of LVLMs overlook some critical issues. On the one hand, they do not guarantee that all evaluation samples can not be correctly answered without the visual content. On the other hand, current evaluations consistently adhere to the process of inferring on given benchmarks and calculating scores for LVLMs, overlooking the possibility of data leakage during multi-modal training. This oversight can lead to unfair comparisons and misjudgments.

    The researchers from the University of Science and Technology of China, The Chinese University of Hong Kong, and Shanghai AI Laboratory present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks six core capabilities and 18 detailed axes, aiming to evaluate LVLMs’ multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline; human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training.

    MMStar is explained in three sections:

    Data Curation Process: Criteria for data curation: The evaluation samples for constructing the MMStar benchmark should meet three fundamental criteria: 1) Visual dependency. The collected samples can be correctly answered only based on understanding the visual content; 2) Minimal data leakage. The collected samples should minimize the risk of unintentional inclusion in LLMs’ training corpus or be effectively transformed from uni-modal to multi-modal formats to prevent LLMs from “recalling” the correct answers; 3) Requiring advanced multi-modal capabilities for resolution. 

    Data filter: For their sample collection, they first chose two benchmarks focused on natural images and four centered on scientific and technical knowledge. Then, they developed an automated pipeline to preliminarily filter out samples that did not meet the first two criteria. Specifically, they employ two closed-source LLMs and six open-source LLMs.

    Manual review: After the coarse filtering with LLM inspectors, they further employ three experts to conduct the manual review process to ensure: 1) each sample’s answer should be based on the understanding of visual content; 2) selected samples should cover a comprehensive range of capability assessment dimensions; 3) most samples should require LVLMs to possess advanced multi-modal abilities for resolution. 

    Core Capabilities: They select and consolidate the dimensions used for assessing LVLMs’ multi-modal capabilities in existing benchmarks and identify six core capability dimensions and eighteen detailed axes.

    Multi-modal Gain/Leakage:  They proposed two unique metrics to assess the degree of data leakage and actual performance gain from the multi-modal training process.

    They evaluated two closed-source and 14 open-source LVLMs on MMStar, with a high-resolution setting that can achieve the best average score of 57.1% among all LVLMs. Increasing the resolution and number of image tokens can boost the average score from 46.1% to 57.1% for GPT4V. Among the open-source LVLMs, InternLMXcomposer2 achieves an impressive score of 55.4%. LLaVA-Next even surpasses GPT4V and GeminiPro-Vision in the mathematics (MA) core capability.

    In conclusion, the researchers delved deeper into the evaluation work for LVLMs, and They found two key issues:  1) visual content is unnecessary for many samples, and 2) unintentional data leakage exists in LLM and LVLM training. Researchers developed an elite vision-dependent multi-modal benchmark named MMStar and proposed two metrics to measure the data leakage and actual performance gain in LVLMs’ multi-modal training. MMStar undergoes the manual review of each sample, covering six core capabilities and 18 detailed axes for an in-depth evaluation of LVLMs’ multimodal capabilities. Evaluating 16 diverse LVLMs on MMStar, even the best model scores under 60 on average. 

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 39k+ ML SubReddit

    The post Are We on the Right Way for Evaluating Large Vision-Language Models? This AI Paper from China Introduces MMStar: An Elite Vision-Dependent Multi-Modal Benchmark appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCyberattack Cripples NorthBay VacaValley Hospital, Patients Left in Limbo
    Next Article What is GNUnet? A Complete Guide

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Switzerland’s open-source rules and Google’s privacy plans lead the Index

    Development

    X’s ‘Massive Cyberattack’ has Links to Ukraine, Musk Claims. But Was It Really Ukraine?

    Development

    How to document multiple APIs in Laravel with Scramble

    Development

    Microsoft Patches Zero-Day Flaw Exploited by North Korea’s Lazarus Group

    Development

    Highlights

    Artificial Intelligence

    Researchers reduce bias in AI models while preserving or improving accuracy

    December 20, 2024

    Machine-learning models can fail when they try to make predictions for individuals who were underrepresented…

    Model-Driven Heart Rate Estimation and Heart Murmur Detection Based on Phonocardiogram

    August 1, 2024

    Machine Learning in Linux: Stability Matrix – Package Manager for Stable Diffusion

    March 19, 2025

    SAP Update Addresses Critical Vulnerabilities That Enable System Takeover by Hackers

    August 14, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.