Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong

    Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong

    December 20, 2024

    Large Language Models (LLMs) and neural architectures have significantly advanced capabilities, particularly in processing longer contexts. These improvements have profound implications for various applications. Enhanced context handling enables models to generate more accurate and contextually relevant responses by utilizing comprehensive information. The expanded context capacity has significantly strengthened in-context learning capabilities, allowing models to utilize more examples and follow complex instructions effectively. Despite these technological leaps, evaluation benchmarks have not evolved correspondingly. Current assessment tools like Longbench and L-Eval remain limited to 40,000 tokens. At the same time, modern models can process hundreds of thousands or even millions of tokens, creating a significant gap between model capabilities and evaluation methods.

    The evolution of long-context evaluation benchmarks began with Long Range Arena (LRA), which handled sequences up to 16,000 tokens but focused primarily on specialized tasks like ListOps and Byte-Level operations. This limitation prompted the development of more comprehensive evaluation frameworks. Notable among these are LongBench, Scrolls, and L-Eval, which incorporate diverse tasks ranging from summarization to code completion, with token lengths varying from 3,000 to 60,000. Recent developments have produced more specialized benchmarks focusing on in-context learning and instruction, such as LongAlign and LongICLBench. Additional datasets like InfinityBench, NovelQA, and ChapterBreak have pushed boundaries further, handling up to 636,000 tokens and covering domains from Wikipedia articles to movie scripts.

    Researchers from AIRI, Moscow, Russia, Neural Networks and Deep Learning Lab, MIPT, Dolgoprudny, Russia, and London Institute for Mathematical Sciences, London, UK introduce BABILong, an innovative benchmark designed to evaluate language models’ reasoning capabilities across extremely long documents. This comprehensive evaluation framework encompasses 20 distinct reasoning tasks, including fact chaining, induction, deduction, and list handling, utilizing books from the PG19 corpora as source material. The benchmark’s flexibility allows for testing sequences of up to 50 million tokens, making it uniquely suited for evaluating next-generation models. Initial testing reveals significant limitations in current models, with popular LLMs effectively utilizing only 10-20% of available context. While Retrieval-Augmented Generation methods achieve 60% accuracy on single-fact questions, architectural innovations like Mamba and Recurrent Memory Transformers demonstrate superior performance, with ARMT notably processing sequences up to 50 million tokens.

    The BABILong benchmark employs a distinctive methodology to evaluate language models’ capabilities in handling extended contexts. By embedding task-relevant sentences within irrelevant text drawn from the PG19 dataset, the benchmark creates a challenging environment that mirrors real-world scenarios where crucial information is dispersed throughout lengthy documents. This approach allows for unlimited scaling of context length, enabling the evaluation of models with context windows of millions of tokens. The benchmark builds upon the original bAbI tasks, which assess fundamental reasoning capabilities through simulated interactions between characters and objects. These tasks labeled QA1 through QA20, test various cognitive abilities including spatial reasoning, temporal understanding, and deduction. Notably, this synthetic approach ensures immunity to training data contamination, a common vulnerability in traditional NLP benchmarks.

    A comprehensive analysis of language models’ context utilization reveals significant limitations in their ability to process long sequences effectively. Testing across various question-answering tasks demonstrates that most current LLMs efficiently use only 10-20% of their advertised context window. Among 34 tested models, only 23 achieved the benchmark threshold of 85% accuracy on basic tasks without distractor text. Performance varies significantly across different architectures: while models like GPT-4 and Llama-3.1-70b maintain effectiveness up to 16K tokens, most models struggle beyond 4K tokens. Recent developments show promising improvements, with Qwen-2.5 models leading among open LLMs. The evaluation also explored alternative approaches, including Retrieval-Augmented Generation (RAG) and fine-tuned models. While RAG demonstrates limited success, fine-tuned recurrent memory models, particularly ARMT, show remarkable capabilities, processing sequences up to 50 million tokens with consistent performance.

    BABILong represents a significant advancement in evaluating language models’ long-context capabilities through its unique combination of scalability and diverse reasoning tasks. The benchmark’s adaptable design allows for testing sequences from 0 to 10 million tokens while maintaining algorithmic control over document length and fact placement. Testing revealed that current models, including advanced systems like GPT-4 and Gemini 1.5 Pro, utilize only 5-25% of their input context effectively. While newer models like Llama-3.1 and Qwen-2.5 demonstrate improved performance, they still face limitations. Fine-tuning experiments proved particularly revealing, showing that even relatively small models like RMT and ARMT (137M parameters) can effectively handle BABILong tasks, with ARMT notably processing sequences up to 50 million tokens, far surpassing Mamba’s practical limit of 128K tokens.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow AI Models Learn to Solve Problems That Humans Can’t
    Next Article Patronus AI Open Sources Glider: A 3B State-of-the-Art Small Language Model (SLM) Judge

    Related Posts

    Security

    ⚡ Weekly Recap: APT Intrusions, AI Malware, Zero-Click Exploits, Browser Hijacks and More

    June 2, 2025
    Security

    Google Fights Back: Appeals Order to Sell Chrome Browser

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Xbox “Adaptive Joystick” launched at Microsoft Ability Summit — Now available exclusively in the Microsoft Store

    News & Updates

    Key Factors to Consider Before Hiring React Native Developers for Your Project🔍

    Web Development

    The First Descendant: Known issues and bugs

    Development

    ezEngine – C++ game engine

    Linux

    Highlights

    Development

    What is Typecasting in Go? Explained with Code Examples

    April 22, 2025

    When you’re working with data in Go, especially when you need to handle dynamic inputs…

    Generative AI UX — Developing Innovative Use Cases for the Enterprise

    November 28, 2024

    AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

    August 20, 2024

    NASGraph: A Novel Graph-based Machine Learning Method for NAS Featuring Lightweight (CPU-only) Computation and is Data-Agnostic and Training-Free

    May 6, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.