RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems

Large Language Models (LLMs) have revolutionized generative AI, showing remarkable capabilities in producing human-like responses. However, these models face a critical challenge known as hallucination, the tendency to generate incorrect or irrelevant information. This issue poses significant risks in high-stakes applications such as medical evaluations, insurance claim processing, and autonomous decision-making systems where accuracy is most important. The hallucination problem extends beyond text-based models to vision-language models (VLMs) that process images and text queries. Despite developing robust VLMs such as LLaVA, InstructBLIP, and VILA, these systems struggle with generating accurate responses based on image inputs and user queries.

Existing research has introduced several methods to address hallucination in language models. For text-based systems, FactScore improved accuracy by breaking long statements into atomic units for better verification. Lookback Lens developed an attention score analysis approach to detect context hallucination, while MARS implemented a weighted system focusing on crucial statement components. For RAG systems specifically, RAGAS and LlamaIndex emerged as evaluation tools, with RAGAS focusing on response accuracy and relevance using human evaluators, while LlamaIndex employs GPT-4 for faithfulness assessment. However, no existing works provide hallucination scores specifically for multi-modal RAG systems, where the contexts include multiple pieces of multi-modal data.

Researchers from the University of Maryland, College Park, MD, and NEC Laboratories America, Princeton, NJ have proposed RAG-check, a comprehensive method to evaluate multi-modal RAG systems. It consists of three key components designed to assess both relevance and accuracy. The first component involves a neural network that evaluates the relevancy of each retrieved piece of data to the user query. The second component implements an algorithm that segments and categorizes the RAG output into scorable (objective) and non-scorable (subjective) spans. The third component utilizes another neural network to evaluate the correctness of objective spans against the raw context, which can include both text and images converted to text-based format through VLMs.

The RAG-check architecture uses two primary evaluation metrics: the Relevancy Score (RS) and Correctness Score (CS) to evaluate different aspects of RAG system performance. For evaluating selection mechanisms, the system analyzes the relevancy scores of the top 5 retrieved images across a test set of 1,000 questions, providing insights into the effectiveness of different retrieval methods. In terms of context generation, the architecture allows for flexible integration of various model combinations either separate VLMs (like LLaVA or GPT4) and LLMs (such as LLAMA or GPT-3.5), or unified MLLMs like GPT-4. This flexibility enables a comprehensive evaluation of different model architectures and their impact on response generation quality.

The evaluation results demonstrate significant performance variations across different RAG system configurations. When using CLIP models as vision encoders with cosine similarity for image selection, the average relevancy scores ranged from 30% to 41%. However, implementing the RS model for query-image pair evaluation dramatically improves relevancy scores to between 71% and 89.5%, though at the cost of a 35-fold increase in computational requirements when using an A100 GPU. GPT-4o emerges as the superior configuration for context generation and error rates, outperforming other setups by 20%. The remaining RAG configurations show comparable performance, with an accuracy rate between 60% and 68%.

In conclusion, researchers RAG-check, a novel evaluation framework for multi-modal RAG systems to address the critical challenge of hallucination detection across multiple images and text inputs. The framework’s three-component architecture, comprising relevancy scoring, span categorization, and correctness assessment shows significant improvements in performance evaluation. The results reveal that while the RS model substantially enhances relevancy scores from 41% to 89.5%, it comes with increased computational costs. Among various configurations tested, GPT-4o emerged as the most effective model for context generation, highlighting the potential of unified multi-modal language models in improving RAG system accuracy and reliability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

7 MagSafe accessories that I recommend every iPhone user should have

I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

Student Record Android App using SQLite

Student Record Android App using SQLite

When Array uses less memory than Uint8Array (in V8)

Laravel 12 Starter Kits: Definite Guide Which to Choose

Photobooth is photobooth software for the Raspberry Pi and PC

Photobooth is photobooth software for the Raspberry Pi and PC

Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

BOND 2025 AI Trends Report Shows AI Ecosystem Growing Faster than Ever with Explosive User and Developer Adoption

What software do you use to test desktop applications on macOS?

Kobiton Delivers for Mobile Developers with Support for iOS 18 Beta

CVE-2025-47729 – TeleMessage End-to-End Encryption Vulnerability

CVE-2025-44885 – Fortinet Wireless Access Point Stack Overflow Vulnerability

The Importance of Content Moderation in Salesforce Communities

Going Viral on Pinterest to Get 350K Followers

Microsoft is adding Clock to Windows 11 Calendar flyout after removing it in Windows 10

Fota Wildlife Park Confirms Cyberattack, Investigates Data Exposure

RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems

Related Posts