Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 4, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 4, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 4, 2025

      Smashing Animations Part 4: Optimising SVGs

      June 4, 2025

      I test AI tools for a living. Here are 3 image generators I actually use and how

      June 4, 2025

      The world’s smallest 65W USB-C charger is my latest travel essential

      June 4, 2025

      This Spotlight alternative for Mac is my secret weapon for AI-powered search

      June 4, 2025

      Tech prophet Mary Meeker just dropped a massive report on AI trends – here’s your TL;DR

      June 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025
      Recent

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025

      Simplify Negative Relation Queries with Laravel’s whereDoesntHaveRelation Methods

      June 4, 2025

      Cast Model Properties to a Uri Instance in 12.17

      June 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025
      Recent

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025

      Rilasciata /e/OS 3.0: Nuova Vita per Android Senza Google, Più Privacy e Controllo per l’Utente

      June 4, 2025

      Rilasciata Oracle Linux 9.6: Scopri le Novità e i Miglioramenti nella Sicurezza e nelle Prestazioni

      June 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs

    Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs

    June 28, 2024

    Natural Language Processing (NLP) is a critical area of artificial intelligence that focuses on the interaction between computers and human language. It involves developing algorithms and models that enable computers to comprehend, interpret, and generate human language. This technology finds applications in various domains, such as machine translation, sentiment analysis, and information retrieval.

    What presents a challenge is the evaluation of long-context language models. These models are crucial for tasks that require understanding and generating text based on extensive context. However, they often need help maintaining consistency and accuracy over long passages, leading to potential errors and inefficiencies in applications requiring deep contextual understanding.

    Existing research includes frameworks like “needle-in-a-haystack” (NIAH) for long-context language model evaluation. Models such as GPT-4 and RULER are evaluated using these methods. These frameworks typically involve synthetic tasks generated programmatically or by language models, which can lack real-world complexity. Benchmarks like NIAH and its variants must fully capture the nuances of narrative text, often failing in global reasoning tasks. This synthetic nature of current methods limits their effectiveness in assessing true language comprehension.

    Researchers from UMass Amherst, Allen Institute for AI, and Princeton University have introduced a new evaluation methodology called NOCHA (Narrative Open-Contextualized Human Annotation). This approach is designed to assess the performance of long-context language models more accurately. NOCHA involves collecting minimal narrative pairs, where one claim is true, and the other is false, both written by readers of books.

    The NOCHA methodology involves collecting narrative minimal pairs from recently published fictional books. Annotators familiar with these books generate pairs of true and false claims based on the content. This dataset includes 1,001 pairs derived from 67 books used to evaluate models like GPT-4 and RULER. Each model is prompted with these claims and the entire book content to verify the claims. The process ensures models are tested on realistic, contextually rich scenarios. Data collection and quality control involve multiple annotators and extensive reviews to maintain high accuracy in claim verification.

    The research demonstrated that current long-context language models, including GPT-4 and its variants, achieve varying degrees of accuracy. For example, GPT-4 attained an accuracy of 76.7% on balanced data but only 55.8% when proper context utilization was required. This result indicates a substantial gap between human and model performance, highlighting the need for further advancements.

    The performance of these models was evaluated on various metrics, including their ability to verify claims about book content accurately. Human readers achieved a claim verification accuracy of 96.9%, significantly higher than the best-performing model. This result underscores the models’ struggles with tasks that require global reasoning over extended contexts instead of simple sentence-level retrieval.

    In conclusion, the research identifies significant challenges in evaluating long-context language models and introduces a novel methodology to address these issues. The NOCHA approach offers a more realistic and rigorous framework for testing these models, providing valuable insights into their strengths and limitations. This work emphasizes the importance of developing more sophisticated evaluation techniques to advance the field of NLP.

    Check out the Paper, GitHub, and Leaderboard. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 45k+ ML SubReddit

    Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generally available! [Advertisement]

    The post Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleWhat if We could Universally Edit Any Two Pieces of DNA? Meet ‘Bridge Editing’ and ‘Bridge RNA’: A Modular Approach to RNA-Guided Genetic Rearrangements in Bacteria
    Next Article Imbue Team Trains 70B-Parameter Model From Scratch: Innovations in Pre-Training, Evaluation, and Infrastructure for Advanced AI Performance

    Related Posts

    Security

    HPE StoreOnce Faces Critical CVE-2025-37093 Vulnerability — Urges Immediate Patch Upgrade

    June 4, 2025
    Security

    CISA Adds Qualcomm Vulnerabilities to KEV Catalog

    June 4, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-47725 – Delta Electronics CNCSoft Remote Code Execution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Unable to connect to Remote Server using Selenium Chrome RemoteWebDriver

    Development

    Apple Home finally gets robot vacuum support, thanks to Matter and iOS 18.4

    News & Updates

    Fine-Tuning NVIDIA NV-Embed-v1 on Amazon Polarity Dataset Using LoRA and PEFT: A Memory-Efficient Approach with Transformers and Hugging Face

    Machine Learning

    Highlights

    How to install Ubuntu Server in under 30 minutes

    April 14, 2025

    I walk you through installing one of the most user-friendly server platforms available. Whether you’re…

    Deletion Vectors in Delta Live Tables: Identifying and Remediating Compliance Risks

    March 27, 2025

    I Shall Call To You Next

    February 24, 2025

    The Snowballing of the Snowflake Breach: All About the Massive Snowflake Data Breach

    June 17, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.