LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks

The ability of learning to evaluate is increasingly taking on a pivotal role in the development of modern large multimodal models (LMMs). As pre-training on existing web data reaches its limits, researchers are shifting towards post-training with AI-enhanced synthetic data. This transition highlights the growing importance of learning to evaluate in modern LMMs. Reliable AI evaluation is important for human labor in complex task assessments, generating effective reward signals in reinforcement learning, and guiding inference-time search. Despite the progress in single-image, multi-image, and video scenarios, the development of open LMMs capable of evaluating the performance of other multimodal models presents a gap in the field.

Existing attempts to address the challenge of AI evaluation have primarily focused on using proprietary LMMs like GPT-4V as generalist evaluators for vision-language tasks. These models have been used in evaluation benchmarks for complex scenarios such as visual chat and detailed captioning. Moreover, open-source alternatives like Prometheus-Vision have emerged as evaluators for specific user-designed scoring criteria. In the preference learning for LMMs, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align models with human intentions. Recent research has expanded these concepts to the multimodal space, exploring various strategies to improve visual chat abilities and reduce hallucinations in vision-language models.

Researchers from ByteDance and the University of Maryland, College Park have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. This approach focuses on curating instruction-following data tailored for evaluation purposes. It addresses two primary scenarios: serving as an LMM-as-a-Judge and facilitating Preference Learning. It aims to provide reliable evaluation scores comparable to proprietary models like GPT-4V, offering a free alternative for various evaluation benchmarks in the first scenario. It presents a scalable solution for generating effective reward signals, reducing dependence on costly human feedback collection in the second scenario. The LLaVA-Critic shows a high correlation with commercial GPT models in evaluation tasks and superior performance in preference learning.

LLaVA-Critic is developed by fine-tuning a pre-trained LMM, capable of following diverse instructions. This approach ensures the model can handle a range of high-quality vision tasks. The training process involves using an evaluation prompt that combines multimodal instruction input, model response(s), and an optional reference response. LLaVA-Critic is trained to predict quantitative pointwise scores or pairwise rankings based on specified criteria and provides detailed justifications for its judgments. The model uses standard cross-entropy loss for judgments and justifications. The researchers start with the LLaVA-OneVision(OV) 7B/72B pre-trained checkpoint and fine-tune it on the LLaVA-Critic-113k dataset for one epoch.

The results demonstrate significant improvements in both pointwise scoring and pairwise ranking capabilities of LLaVA-Critic compared to baseline models. The LLaVA-Critic-72B achieves the highest average Pearson-r (0.754) and Kendallâ€™s Tau (0.933) in pointwise scoring, outperforming the baseline LLaVA-OV-72B. In pairwise ranking, LLaVA-Critic-72B outperforms GPT-4o and GPT-4V in comparisons without tie, achieving 73.6% accuracy. LLaVA-Critic-7B outperforms most baselines compared to commercial models and other open-source LMMs in the MLLM-as-a-Judge scenario. These results highlight the effectiveness of LLaVA-Critic as an open-source alternative for multimodal model evaluation.

In conclusion, researchers have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. The researchers have used a high-quality, diverse instruction-following dataset to develop this model that excels in two critical areas. First, as a generalized evaluator, LLaVA-Critic shows remarkable alignment with human and GPT-4o preferences across various evaluation tasks, offering a viable open-source alternative to commercial models. Secondly, in preference learning scenarios, LLaVA-Critic functions as a reliable reward model, outperforming human feedback-based approaches in enhancing the visual chat capabilities of LMMs. This research is a key step toward building self-critiquing capabilities in open-source LMMs, enabling future advancements in scalable, superhuman AI alignment feedback.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Letâ€™s collaborate!

The post LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

CodeSOD: While This Works

CodeSOD: Ready Xor Not

CodeSOD: A Set of Mistakes

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

I tried an ultra-thin iPhone case, and here’s how my daunting experience went

I found one of the fastest-charging portable batteries for home backups – and it’s on sale

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

5 Compelling Reasons to Choose Linux Over Windows

Rilasciato DXVK 2.5.2: Ottimizzazioni e Correzioni per i Giochi Windows su GNU/Linux

LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks

Why developers needn’t fear CSS – with the King of CSS himself Kevin Powell [Podcast #154]

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

5 ways to get more out of your WearOS smartwatch

Ready Tensorâ€™s Deep Dive into Time Series Step Classification: Comparative Analysis of 25 Machine Learning and Neural Network Models

Quantum Computing Secrets They Don’t Want You to Know!

GuardZoo Malware Targets Over 450 Middle Eastern Military Personnel

timwassenburg/laravel-service-generator

Manjaro 24.0: Plasma 6, GNOME 46, LXQt 2.0, and More

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

What the First State of HTML Survey Taught Us

LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks

Related Posts