Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 20, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 20, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 20, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 20, 2025

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025

      Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

      May 20, 2025

      Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

      May 20, 2025

      The biggest unanswered questions about Xbox’s next-gen consoles

      May 20, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      HCL Commerce V9.1 – The Power of HCL Commerce Search

      May 20, 2025
      Recent

      HCL Commerce V9.1 – The Power of HCL Commerce Search

      May 20, 2025

      Community News: Latest PECL Releases (05.20.2025)

      May 20, 2025

      Getting Started with Personalization in Sitecore XM Cloud: Enable, Extend, and Execute

      May 20, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025
      Recent

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025

      Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

      May 20, 2025

      Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

      May 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable Queries

    Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable Queries

    May 20, 2025

    While RAG enables responses without extensive model retraining, current evaluation frameworks focus on accuracy and relevance for answerable questions, neglecting the crucial ability to reject unsuitable or unanswerable requests. This creates high risks in real-world applications where inappropriate responses can lead to misinformation or harm. Existing unanswerability benchmarks are inadequate for RAG systems, as they contain static, general requests that cannot be customized to specific knowledge bases. When RAG systems reject queries, it often stems from retrieval failures rather than genuine recognition that certain requests should not be fulfilled, highlighting a critical gap in evaluation methodologies.

    Unanswerable benchmarks research has provided insights into model noncompliance, exploring ambiguous questions and underspecified inputs. RAG evaluation has advanced through diverse LLM-based techniques, with methods like RAGAS and ARES evaluating retrieved document relevance, while RGB and MultiHop-RAG focus on output accuracy against ground truths. In Unanswerable RAG Evaluation, some benchmarks have begun evaluating rejection capabilities in RAG systems, but use LLM-generated unanswerable contexts as external knowledge and narrowly evaluate rejection of single-type unanswerable requests. However, current methods fail to adequately assess RAG systems’ ability to reject diverse unanswerable requests across user-provided knowledge bases.

    Researchers from Salesforce Research have proposed UAEval4RAG, a framework designed to synthesize datasets of unanswerable requests for any external knowledge database and automatically evaluate RAG systems. UAEval4RAG not only assesses how well RAG systems respond to answerable requests but also their ability to reject six distinct categories of unanswerable queries: Underspecified, False-presuppositions, Nonsensical, Modality-limited, Safety Concerns, and Out-of-Database. Researchers also create an automated pipeline that generates diverse and challenging requests designed for any given knowledge base. The generated datasets are then used to evaluate RAG systems with two LLM-based metrics: Unanswerable Ratio and Acceptable Ratio.

    UAEval4RAG evaluates how different RAG components affect performance on both answerable and unanswerable queries. After testing 27 combinations of embedding models, retrieval models, rewriting methods, rerankers, 3 LLMs, and 3 prompting techniques across four benchmarks, results show no single configuration optimizes performance across all datasets due to varying knowledge distribution. LLM selection proves critical, with Claude 3.5 Sonnet improving correctness by 0.4%, and the unanswerable acceptable ratio by 10.4% over GPT-4o. Prompt design impacts performance, with optimal prompts enhancing unanswerable query performance by 80%. Moreover, three metrics evaluate the capability of RAG systems to reject unanswerable requests: Acceptable Ratio, Unanswered Ratio, and Joint Score.

    The UAEval4RAG shows high effectiveness in generating unanswerable requests, with 92% accuracy and strong inter-rater agreement scores of 0.85 and 0.88 for TriviaQA and Musique datasets, respectively. LLM-based metrics show robust performance with high accuracy and F1 scores across three LLMs, validating their reliability in evaluating RAG systems regardless of the backbone model used. Comprehensive analysis reveals that no single combination of RAG components excels across all datasets, while prompt design impacts hallucination control and query rejection capabilities. Dataset characteristics with modality-related performance correlate to keyword prevalence (18.41% in TriviaQA versus 6.36% in HotpotQA), and safety-concerned request handling based on chunk availability per question.

    In conclusion, researchers introduced UAEval4RAG, a framework for evaluating RAG systems’ ability to handle unanswerable requests, addressing a critical gap in existing evaluation methods that predominantly focus on answerable queries. Future work could benefit from integrating more diverse human-verified sources to increase generalizability. While the proposed metrics demonstrate strong alignment with human evaluations, tailoring them to specific applications could further enhance effectiveness. Current evaluation focuses on single-turn interactions, whereas extending the framework to multi-turn dialogues would better capture real-world scenarios where systems engage in clarifying exchanges with users to manage underspecified or ambiguous queries.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable Queries appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration
    Next Article Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 20, 2025
    Machine Learning

    Researchers from Renmin University and Huawei Propose MemEngine: A Unified Modular AI Library for Customizing Memory in LLM-Based Agents

    May 20, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Researches Discovers New Android Banking Trojan ‘Brokewell’ Disguised as Chrome Update

    Development

    6 Best Purchase Order Software in 2024

    Artificial Intelligence

    Inspirational Websites Roundup: Webflow Special #5

    Development

    Installing Add-ons and Builds in Kodi

    Linux

    Highlights

    Development

    VS Code Snippets for Livewire and Alpine.js

    July 27, 2024

    VS Code – as many code editors do – includes a feature called “snippets”. The…

    Harness the power of MCP servers with Amazon Bedrock Agents

    April 1, 2025

    CVE-2025-4100 – Nautic Pages WordPress Stored Cross-Site Scripting

    May 1, 2025

    The Razer Blade 14 got the AI laptop treatment with less weight, more ports and speakers, and two colors

    May 19, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.