Advancing Reliable Question Answering with the CRAG Benchmark

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP), particularly in Question Answering (QA). However, hallucination remains a significant obstacle as LLMs may generate factually inaccurate or ungrounded responses. Studies reveal that even state-of-the-art models like GPT-4 struggle with accurately answering questions involving changing facts or less popular entities. Overcoming hallucinations is crucial for developing reliable QA systems. Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address LLMsâ€™ knowledge deficiencies, but it faces challenges like selecting relevant information, reducing latency, and synthesizing information for complex queries.

Researchers from Meta Reality Labs, FAIR, Meta, HKUST, and HKUST (GZ) proposed a benchmark called CRAG (Comprehensive benchmark for RAG), which aims to incorporate five critical features: realism, richness, insightfulness, reliability, and longevity. It contains 4,409 diverse QA pairs from five domains, including simple fact-based and seven types of complex questions. CRAG covers varying entity popularity and temporal spans to enable insights. The questions are manually verified and paraphrased for realism and reliability. Also, CRAG provides mock APIs simulating retrieval from web pages (via Brave Search API) and mock knowledge graphs with 2.6 million entities, reflecting realistic noise. The benchmark offers three tasks to evaluate the web retrieval, structured querying, and summarisation capabilities of RAG solutions.

A RAG QA system involves three tasks designed to evaluate the different capabilities of the systems. All tasks share the same set of (question, answer) pairs but differ in the external data accessible for retrieval to augment answer generation. Task 1 (Retrieval Summarization) provides up to five potentially relevant web pages per question to test the answer generation capability. Task 2 (KG and Web Retrieval Augmentation) further provides mock APIs to access structured data from knowledge graphs (KGs), examining the systemâ€™s ability to query structured sources and synthesize information. Task 3 is similar to Task 2, but provides 50 web pages instead of 5 as retrieval candidates, testing the systemâ€™s ability to rank and utilize a larger set of potentially noisy but more comprehensive information.

The results and comparisons demonstrate the effectiveness of the proposed CRAG benchmark. While advanced language models like GPT-4 achieve only around 34% accuracy on CRAG, incorporating straightforward RAG improves accuracy to 44%. However, even state-of-the-art industry RAG solutions answer only 63% of questions without hallucination, struggling with facts of higher dynamism, lower popularity, or greater complexity. These evaluations highlight that CRAG has an appropriate level of difficulty and enables insights from its diverse data. The evaluations also underscore the research gaps towards developing fully trustworthy question-answering systems, making CRAG a valuable benchmark for driving further progress in this field.

In this study, the researchers introduce CRAG, a comprehensive benchmark that aims to propel research in RAG for question-answering systems. Through rigorous empirical evaluations, CRAG exposes shortcomings in existing RAG solutions and offers valuable insights for future improvements. The benchmarkâ€™s creators plan to continuously enhance and expand CRAG to include multi-lingual questions, multi-modal inputs, multi-turn conversations, and more. This ongoing development ensures CRAG remains at the vanguard of driving RAG research, adapting to emerging challenges, and evolving to address new research needs in this rapidly progressing field. The benchmark provides a robust foundation for advancing reliable, grounded language generation capabilities.

Check out theÂ Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 44k+ ML SubReddit

The post Advancing Reliable Question Answering with the CRAG Benchmark appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Advancing Reliable Question Answering with the CRAG Benchmark

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

CVE-2025-3272 – OpenText Operations Bridge Manager Password Change Bypass

Distribuzioni GNU/Linux: 5 tra le più insolite e originali!

Selenium OnClick identify button C#

Russian money-laundering network linked to drugs and ransomware disrupted, 84 arrests

Are Logos Becoming Irrelevant in Modern Branding?

Decoding the AI mind: Anthropic researchers peer inside the â€œblack boxâ€

Microsoft should extend support for Classic Outlook on Copilot+ PCs, Windows users agree

CrÃ©er des applications modernes plus rapidementÂ : nouvelles fonctionnalitÃ©s au MongoDB.local NYCÂ 2024

Advancing Reliable Question Answering with the CRAG Benchmark

Related Posts