Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong

Large Language Models (LLMs) and neural architectures have significantly advanced capabilities, particularly in processing longer contexts. These improvements have profound implications for various applications. Enhanced context handling enables models to generate more accurate and contextually relevant responses by utilizing comprehensive information. The expanded context capacity has significantly strengthened in-context learning capabilities, allowing models to utilize more examples and follow complex instructions effectively. Despite these technological leaps, evaluation benchmarks have not evolved correspondingly. Current assessment tools like Longbench and L-Eval remain limited to 40,000 tokens. At the same time, modern models can process hundreds of thousands or even millions of tokens, creating a significant gap between model capabilities and evaluation methods.

The evolution of long-context evaluation benchmarks began with Long Range Arena (LRA), which handled sequences up to 16,000 tokens but focused primarily on specialized tasks like ListOps and Byte-Level operations. This limitation prompted the development of more comprehensive evaluation frameworks. Notable among these are LongBench, Scrolls, and L-Eval, which incorporate diverse tasks ranging from summarization to code completion, with token lengths varying from 3,000 to 60,000. Recent developments have produced more specialized benchmarks focusing on in-context learning and instruction, such as LongAlign and LongICLBench. Additional datasets like InfinityBench, NovelQA, and ChapterBreak have pushed boundaries further, handling up to 636,000 tokens and covering domains from Wikipedia articles to movie scripts.

Researchers from AIRI, Moscow, Russia, Neural Networks and Deep Learning Lab, MIPT, Dolgoprudny, Russia, and London Institute for Mathematical Sciences, London, UK introduce BABILong, an innovative benchmark designed to evaluate language models’ reasoning capabilities across extremely long documents. This comprehensive evaluation framework encompasses 20 distinct reasoning tasks, including fact chaining, induction, deduction, and list handling, utilizing books from the PG19 corpora as source material. The benchmark’s flexibility allows for testing sequences of up to 50 million tokens, making it uniquely suited for evaluating next-generation models. Initial testing reveals significant limitations in current models, with popular LLMs effectively utilizing only 10-20% of available context. While Retrieval-Augmented Generation methods achieve 60% accuracy on single-fact questions, architectural innovations like Mamba and Recurrent Memory Transformers demonstrate superior performance, with ARMT notably processing sequences up to 50 million tokens.

The BABILong benchmark employs a distinctive methodology to evaluate language models’ capabilities in handling extended contexts. By embedding task-relevant sentences within irrelevant text drawn from the PG19 dataset, the benchmark creates a challenging environment that mirrors real-world scenarios where crucial information is dispersed throughout lengthy documents. This approach allows for unlimited scaling of context length, enabling the evaluation of models with context windows of millions of tokens. The benchmark builds upon the original bAbI tasks, which assess fundamental reasoning capabilities through simulated interactions between characters and objects. These tasks labeled QA1 through QA20, test various cognitive abilities including spatial reasoning, temporal understanding, and deduction. Notably, this synthetic approach ensures immunity to training data contamination, a common vulnerability in traditional NLP benchmarks.

A comprehensive analysis of language models’ context utilization reveals significant limitations in their ability to process long sequences effectively. Testing across various question-answering tasks demonstrates that most current LLMs efficiently use only 10-20% of their advertised context window. Among 34 tested models, only 23 achieved the benchmark threshold of 85% accuracy on basic tasks without distractor text. Performance varies significantly across different architectures: while models like GPT-4 and Llama-3.1-70b maintain effectiveness up to 16K tokens, most models struggle beyond 4K tokens. Recent developments show promising improvements, with Qwen-2.5 models leading among open LLMs. The evaluation also explored alternative approaches, including Retrieval-Augmented Generation (RAG) and fine-tuned models. While RAG demonstrates limited success, fine-tuned recurrent memory models, particularly ARMT, show remarkable capabilities, processing sequences up to 50 million tokens with consistent performance.

BABILong represents a significant advancement in evaluating language models’ long-context capabilities through its unique combination of scalability and diverse reasoning tasks. The benchmark’s adaptable design allows for testing sequences from 0 to 10 million tokens while maintaining algorithmic control over document length and fact placement. Testing revealed that current models, including advanced systems like GPT-4 and Gemini 1.5 Pro, utilize only 5-25% of their input context effectively. While newer models like Llama-3.1 and Qwen-2.5 demonstrate improved performance, they still face limitations. Fine-tuning experiments proved particularly revealing, showing that even relatively small models like RMT and ARMT (137M parameters) can effectively handle BABILong tasks, with ARMT notably processing sequences up to 50 million tokens, far surpassing Mamba’s practical limit of 128K tokens.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

The Alters: Release date, mechanics, and everything else you need to know

I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

May report 2025

May report 2025

Write more reliable JavaScript with optional chaining

Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

The Alters: Release date, mechanics, and everything else you need to know

The Alters: Release date, mechanics, and everything else you need to know

I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong

⚡ Weekly Recap: APT Intrusions, AI Malware, Zero-Click Exploits, Browser Hijacks and More

Google Fights Back: Appeals Order to Sell Chrome Browser

Xbox “Adaptive Joystick” launched at Microsoft Ability Summit — Now available exclusively in the Microsoft Store

Key Factors to Consider Before Hiring React Native Developers for Your Project🔍

The First Descendant: Known issues and bugs

ezEngine – C++ game engine

What is Typecasting in Go? Explained with Code Examples

Generative AI UX â€” Developing Innovative Use Cases for the Enterprise

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

NASGraph: A Novel Graph-based Machine Learning Method for NAS Featuring Lightweight (CPU-only) Computation and is Data-Agnostic and Training-Free

Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong

Related Posts