Googleâ€™s Infini-attention gives LLMs â€œinfiniteâ€ context

Google researchers developed a technique called Infini-attention, which allows LLMs to handle infinitely long text without increasing compute and memory requirements.

The Transformer architecture of an LLM is what allows it to give attention to all of the tokens in a prompt. The complex dot-product and matrix multiplications it performs are quadratic in complexity.

This means that doubling the tokens in your prompt results in a requirement of four times more memory and processing power. This is why itâ€™s so challenging to make LLMs with large context windows without having memory and compute requirements skyrocket.

In a â€œstandardâ€ LLM, information at the beginning of the prompt content is lost once the prompt becomes larger than the context window. Googleâ€™s research paper explains how Infini-attention can retain data beyond the context window.

Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problemhttps://t.co/zyHMt3inhi pic.twitter.com/ySYEMET9Ef

â€” Aran Komatsuzaki (@arankomatsuzaki) April 11, 2024

How does Infini-attention work?

Infini-attention combines compressive memory techniques with modified attention mechanisms so that relevant older information isnâ€™t lost.

Once the input prompt grows beyond the context length of the model, the compressive memory stores information in a compressed format rather than discarding it.

This allows for older, less immediately relevant information to be stored without memory and compute requirements growing indefinitely as the input grows.

Instead of trying to retain all the older input information, Infini-attentionâ€™s compressive memory weighs and summarizes information that is deemed relevant and worth retaining.

Infini-attention then takes a â€œvanillaâ€ attention mechanism but reuses the key value (KV) states from each subsequent segment in the model rather than discarding them.

Hereâ€™s a diagram that shows the difference between Infini-attention and another extended context model Transformer XL.

Infini-Transformer (top) has an entire context history whereas Transformer-XL(bottom) discards old contexts since it caches the KV states for the last segment only. Source: arXiv

The result is an LLM that gives local attention to recent input data but also carries continuously distilled compressed historical data to which it can apply long-term attention.

The paper notes that â€œThis subtle but critical modification to the attention layer enables LLMs to process infinitely long contexts with bounded memory and computation resources.â€œ

How good is it?

Google ran benchmarking tests using smaller 1B and 8B parameter Infini-attention models. These were compared against other extended context models like Transformer-XL and Memorizing Transformers.

The Infini-Transformer achieved significantly lower perplexity scores than the other models when processing long-context content. A lower perplexity score means the model is more certain of its output predictions.

In the â€œpasskey retrievalâ€ tests the Infini-attention models consistently found the random number hidden in text of up to 1M tokens.

Other models often manage to retrieve the passkey towards the end of the input but struggle to find it in the middle or beginning of long content. Infini-attention had no trouble with this test.

The benchmarking tests are very technical but the short story is that Infini-attention outperformed the baseline models in summarizing and handling long sequences while maintaining context over extended periods.

Significantly, it retained this superior retention capability while requiring 114x less memory.

The benchmark results convince the researchers that Infini-attention could be scaled to handle extremely long input sequences keeping the memory and computational resources bounded.

The plug-and-play nature of Infini-attention means it could be used for continual pre-training and fine-tuning of existing Transformer models. This could effectively extend their context windows without requiring complete retraining of the model.

Context windows will keep growing, but this approach shows that an efficient memory could be a better solution than a large library.

The post Googleâ€™s Infini-attention gives LLMs â€œinfiniteâ€ context appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Googleâ€™s Infini-attention gives LLMs â€œinfiniteâ€ context

How does Infini-attention work?

How good is it?

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

1 Comment

HybridRAG: A Hybrid AI System Formed by Integrating Knowledge Graphs and Vector Retrieval Augmented Generation Outperforming both Individually

YASnippet – template system for Emacs

Nintendo Switch 2 pre-orders delayed, new price hike likely – here’s why

Dynamic Mailer Configuration in Laravel with Mail::build

Shopify Winter ’25 Edition: The “Boring” Site That’s Anything But

Step Towards Best Practices for Open Datasets for LLM Training

TUBA: Finally, an Error Management Tool for Microservices

Community News: Latest PECL Releases (11.26.2024)

Googleâ€™s Infini-attention gives LLMs â€œinfiniteâ€ context

How does Infini-attention work?

How good is it?

Related Posts

1 Comment

Googleâ€™s Infini-attention gives LLMs â€œinfiniteâ€ context