NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning. However, inference-time performance is severely limited by the memory footprint of the key–value (KV) cache, not just the number of tokens produced. In a recent paper, researchers from NVIDIA and the University of Edinburgh introduce Dynamic Memory Sparsification (DMS)—a data-efficient, retrofit-friendly method that compresses KV caches and unlocks inference-time hyper-scaling without degrading model accuracy.

The Bottleneck: KV Cache in Transformer Inference

Transformer-based models like GPT, LLaMA, and Qwen use KV caches to store past token representations for autoregressive generation. This cache grows linearly with sequence length and width (parallel threads), consuming large amounts of GPU memory and leading to slower inference due to frequent memory access.

Existing techniques for KV cache optimization either rely on training-free heuristics—such as attention weight-based token eviction—or require heavy post-training retrofits like Dynamic Memory Compression (DMC). Both have significant downsides: the former tends to hurt accuracy, while the latter is computationally expensive.

Dynamic Memory Sparsification DMS: Compression Without Compromise

Dynamic Memory Sparsification DMS addresses these limitations with a hybrid approach: it sparsifies the KV cache like traditional pruning methods but does so with a minimal training overhead (~1,000 steps) and delayed eviction, which retains tokens temporarily after they’re marked for removal. This design preserves important context information and avoids abrupt accuracy drops.

The core idea is to make eviction decisions differentiable during training using a Gumbel-sigmoid-based sampling mechanism. Tokens predicted for future eviction remain usable for a sliding window duration before being discarded, allowing the model to absorb their informational value more effectively.

Efficient Retrofitting with Minimal Data

Unlike DMC, which requires thousands of training steps and complex gradient-based optimization, DMS introduces no additional parameters per attention head. It reuses a small part of the attention mechanism (a single neuron) to predict eviction. This makes DMS ideal for retrofitting existing models without architectural changes.

Empirical results show that with as few as 1K training steps, DMS can achieve 8× KV cache compression, preserving or even improving model performance across reasoning tasks.

Benchmark Results: Scaling Performance Without Scaling Cost

The research team tested DMS on reasoning-heavy benchmarks like:

AIME 2024 (advanced math)
MATH 500 (mathematical problem solving)
GPQA Diamond (hard science QA)
LiveCodeBench (code generation)

Across model sizes—Qwen-R1 1.5B, 7B, and 32B—DMS improved exact-match performance by 9.1 points on AIME, 7.6 on GPQA, and 9.6 on LiveCodeBench, all under the same memory and compute budgets.

When compared to top-performing baselines like Quest and TOVA, DMS consistently outperformed them in both KV cache read efficiency (runtime proxy) and peak memory usage, achieving better Pareto frontiers.

General-Purpose Utility

DMS also holds up in non-reasoning tasks. On short-context benchmarks like MMLU, GSM8K, and HellaSwag, DMS-maintained performance at compression ratios up to 4× with minimal degradation (~3.5 points). On long-context tasks like Needle-in-a-Haystack and Variable Tracking, DMS even surpassed the vanilla models, suggesting its potential to mitigate issues like information over-squashing in long sequences.

Conclusion

In conclusion, Dynamic Memory Sparsification (DMS) presents a practical and scalable solution for enhancing the inference-time efficiency of Transformer-based language models. By intelligently compressing the KV cache with minimal retraining, DMS enables models to reason over longer sequences or in parallel without increasing runtime or memory demands. Its consistent gains across a range of reasoning and general-purpose tasks highlight its versatility and effectiveness. As LLMs are increasingly deployed in resource-constrained environments, DMS offers a compelling path forward—balancing compression, accuracy, and ease of integration for real-world inference workloads.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

Looking to showcase your product, webinar, or service to over 1 million AI engineers, developers, data scientists, architects, CTOs, and CIOs? Let’s explore a strategic partnership

The post NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

Top 15 Enterprise Use Cases That Justify Hiring Node.js Developers in 2025

The Core Model: Start FROM The Answer, Not WITH The Solution

AI-Generated Code Poses Major Security Risks in Nearly Half of All Development Tasks, Veracode Research Reveals

Understanding the code modernization conundrum

Not just YouTube: Google is using AI to guess your age based on your activity – everywhere

Malicious extensions can use ChatGPT to steal your personal data – here’s how

What Zuckerberg’s ‘personal superintelligence’ sales pitch leaves out

This handy NordVPN tool flags scam calls on Android – even before you answer

Route Optimization through Laravel’s Shallow Resource Architecture

Route Optimization through Laravel’s Shallow Resource Architecture

This Week in Laravel: Laracon News, Free Laravel Idea, and Claude Code Course

Everything We Know About Pest 4

FOSS Weekly #25.31: Kernel 6.16, OpenMandriva Review, Conky Customization, System Monitoring and More

FOSS Weekly #25.31: Kernel 6.16, OpenMandriva Review, Conky Customization, System Monitoring and More

Windows 11’s MSN Widgets board now opens in default browser, such as Chrome (EU only)

Microsoft’s new “move to Windows 11” campaign implies buying OneDrive paid plan

NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

The Bottleneck: KV Cache in Transformer Inference

Dynamic Memory Sparsification DMS: Compression Without Compromise

Efficient Retrofitting with Minimal Data

Benchmark Results: Scaling Performance Without Scaling Cost

General-Purpose Utility

Conclusion

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

A Coding Guide to Build a Scalable Multi-Agent System with Google ADK

CVE-2024-55909 – IBM Concert Software Archive File DoS

Allen Institute for AI-Ai2 Unveils AutoDS: A Bayesian Surprise-Driven Engine for Open-Ended Scientific Discovery

CVE-2025-6897 – D-Link DI-7300G+ HTTPD Debug ASP OS Command Injection Vulnerability

CVE-2025-8043 – Firefox URL Truncation Vulnerability

Taking an all-in-one PC to Starbucks is certainly a move — what’s the weirdest PC setup you’ve seen someone use in public?

ruby-align is Baseline Newly available

CVE-2025-38167 – “NTFS3 Linux Kernel Null Pointer Dereference Vulnerability”

CVE-2025-53622 – DSpace Tomcat Path Traversal Vulnerability

NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

The Bottleneck: KV Cache in Transformer Inference

Dynamic Memory Sparsification DMS: Compression Without Compromise

Efficient Retrofitting with Minimal Data

Benchmark Results: Scaling Performance Without Scaling Cost

General-Purpose Utility

Conclusion

Related Posts