Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Vision-Language Models (VLMs) are increasingly used for generating responses to queries about visual content. Despite their progress, they often suffer from a major issue: generating plausible but incorrect responses, also known as hallucinations. These hallucinations can lead to a lack of trust in these systems, especially in real-world, high-stakes applications. Evaluating the helpfulness and truthfulness of VLM-generated responses is challenging because it requires not only understanding visual content but also verifying each claim made in the response. Traditional benchmarks have not been adequate for addressing this challenge, either because they limit evaluations to simplistic, binary questions or because they rely on incomplete context to judge open-ended responses.

Researchers from Salesforce AI Research have proposed Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm that evaluates VLM responses to open-ended visual queries. In PROVE, researchers use a high-fidelity scene graph representation constructed from hyper-detailed image captions and employ a large language model (LLM) to generate diverse question-answer (QA) pairs along with executable programs to verify each QA pair. This approach allows the creation of a benchmark dataset of 10.5k visually grounded and challenging QA pairs. The evaluation strategy involves measuring both the helpfulness and truthfulness of VLM responses using a unified framework based on scene graph comparisons. This programmatic evaluation provides a more reliable and interpretable assessment of VLM performance compared to previous benchmarks.

The PROVE benchmark uses detailed scene graph representations and executable programs to verify the correctness of VLM responses. Scene graphs, constructed from detailed image captions, contain entities, attributes, and relationships that represent the visual scene. By prompting an LLM, researchers generate open-ended QA pairs and corresponding verification programs that ensure the questions are challenging yet verifiable. Only QA pairs that can be programmatically verified are retained in the benchmark, resulting in a high-quality dataset. The evaluation involves extracting scene graph representations from both the model responses and ground truth answers, and then calculating scores based on the recall and precision of these representations, measuring how helpful and truthful the responses are.

The results of the evaluation show that current VLMs struggle to achieve a good balance between helpfulness and truthfulness. Models such as GPT-4o, Phi-3.5-Vision, and Pixtral demonstrated higher helpfulness scores but not necessarily higher truthfulness. The study also found that increasing model size tends to improve helpfulness but does not always enhance truthfulness. The evaluation of various models revealed that recent improvements in training better VLMs have led to enhanced helpfulness but have not consistently translated into truthful outputs. Notably, the LLaVA-1.5 model series achieved the best truthfulness scores, indicating that smaller, more focused models might outperform larger ones in maintaining accuracy.

In conclusion, PROVE presents a significant advancement in evaluating the helpfulness and truthfulness of VLM-generated responses. By leveraging detailed scene graph representations and programmatic verification, this benchmark provides a more reliable and interpretable evaluation framework. The findings underscore the need for VLMs that strike a balance between generating informative and accurate responses, especially as their use in real-world applications continues to grow. Future research is expected to focus on improving both the helpfulness and truthfulness of these models through advanced training techniques and new evaluation strategies.

Check out the Paper and Dataset Card. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

Mastering SVG Arcs

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Finally, a luxury soundbar that’s compact and delivers immersive audio (and it’s $500 off)

This affordable Lenovo gaming PC is the one I recommend to most people. Here’s why

The last day of ’12 days of OpenAI’ is expected to bring biggest drop yet

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Windows 11 hidden toggle reveals how to turn on or off Administrator protection

10 Must-Have Apps for 3 Monitors You Should Know About

Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

What do the State of CSS and HTML surveys tell us?

Salesforce Data Cloud â€“ Introduction on Salesforce Data Cloud

Best Phone Tracking Software for PC: Personal and Business Apps

Yes, inZOI has very demanding system requirements, but thatâ€™s not even the worst thing

4 best iPhone 16 features that make Apple’s standard model worth upgrading to

Dragon Age: The Veilguard game director talks DLC and expansions

Distribution Release: Zorin OS 17.2

The best business internet providers of 2024

Future-Proofing the Workforce: How Skilling is Cultivating Next-gen Tech Talent

Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Related Posts