Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs

    Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs

    June 28, 2024

    Natural Language Processing (NLP) is a critical area of artificial intelligence that focuses on the interaction between computers and human language. It involves developing algorithms and models that enable computers to comprehend, interpret, and generate human language. This technology finds applications in various domains, such as machine translation, sentiment analysis, and information retrieval.

    What presents a challenge is the evaluation of long-context language models. These models are crucial for tasks that require understanding and generating text based on extensive context. However, they often need help maintaining consistency and accuracy over long passages, leading to potential errors and inefficiencies in applications requiring deep contextual understanding.

    Existing research includes frameworks like “needle-in-a-haystack” (NIAH) for long-context language model evaluation. Models such as GPT-4 and RULER are evaluated using these methods. These frameworks typically involve synthetic tasks generated programmatically or by language models, which can lack real-world complexity. Benchmarks like NIAH and its variants must fully capture the nuances of narrative text, often failing in global reasoning tasks. This synthetic nature of current methods limits their effectiveness in assessing true language comprehension.

    Researchers from UMass Amherst, Allen Institute for AI, and Princeton University have introduced a new evaluation methodology called NOCHA (Narrative Open-Contextualized Human Annotation). This approach is designed to assess the performance of long-context language models more accurately. NOCHA involves collecting minimal narrative pairs, where one claim is true, and the other is false, both written by readers of books.

    The NOCHA methodology involves collecting narrative minimal pairs from recently published fictional books. Annotators familiar with these books generate pairs of true and false claims based on the content. This dataset includes 1,001 pairs derived from 67 books used to evaluate models like GPT-4 and RULER. Each model is prompted with these claims and the entire book content to verify the claims. The process ensures models are tested on realistic, contextually rich scenarios. Data collection and quality control involve multiple annotators and extensive reviews to maintain high accuracy in claim verification.

    The research demonstrated that current long-context language models, including GPT-4 and its variants, achieve varying degrees of accuracy. For example, GPT-4 attained an accuracy of 76.7% on balanced data but only 55.8% when proper context utilization was required. This result indicates a substantial gap between human and model performance, highlighting the need for further advancements.

    The performance of these models was evaluated on various metrics, including their ability to verify claims about book content accurately. Human readers achieved a claim verification accuracy of 96.9%, significantly higher than the best-performing model. This result underscores the models’ struggles with tasks that require global reasoning over extended contexts instead of simple sentence-level retrieval.

    In conclusion, the research identifies significant challenges in evaluating long-context language models and introduces a novel methodology to address these issues. The NOCHA approach offers a more realistic and rigorous framework for testing these models, providing valuable insights into their strengths and limitations. This work emphasizes the importance of developing more sophisticated evaluation techniques to advance the field of NLP.

    Check out the Paper, GitHub, and Leaderboard. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 45k+ ML SubReddit

    Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generally available! [Advertisement]

    The post Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleWhat if We could Universally Edit Any Two Pieces of DNA? Meet ‘Bridge Editing’ and ‘Bridge RNA’: A Modular Approach to RNA-Guided Genetic Rearrangements in Bacteria
    Next Article Imbue Team Trains 70B-Parameter Model From Scratch: Innovations in Pre-Training, Evaluation, and Infrastructure for Advanced AI Performance

    Related Posts

    Security

    Chrome Zero-Day Alert: CVE-2025-5419 Actively Exploited in the Wild

    June 2, 2025
    Security

    CISA Adds 5 Actively Exploited Vulnerabilities to KEV Catalog: ASUS Routers, Craft CMS, and ConnectWise Targeted

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    One of the best Ring cameras I’ve tested is 50% off for a limited time

    News & Updates

    Using JSON Web Tokens with Node.js

    Development

    GitHub Availability Report: December 2024

    News & Updates

    NVIDIA GeForce NOW launches a native Steam Deck app, unlocking 4K cloud gaming & extended battery life for Valve’s handheld

    News & Updates

    Highlights

    Strategic Linear Contextual Bandits

    December 7, 2024

    Motivated by the phenomenon of strategic agents gaming a recommendation system to maximize the number…

    CVE-2024-57783 – “Dot Desktop XSS Command Execution”

    June 2, 2025

    StreamSpeech: an “All in One” seamless model for offline and simultaneous ASR, translation and TTS.

    June 17, 2024

    The latest robot vacuum innovation will have carpet and upholstery lovers drooling

    April 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.