Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      OpenAI o3-pro available in the API, BrowserStack adds Playwright support for real iOS devices, and more – Daily News Digest

      June 12, 2025

      Creating The “Moving Highlight” Navigation Bar With JavaScript And CSS

      June 11, 2025

      Microsoft Copilot’s own default configuration exposed users to the first-ever “zero-click” AI attack, but there was no data breach

      June 13, 2025

      Sam Altman says “OpenAI was forced to do a lot of unnatural things” to meet the Ghibli memes demand surge

      June 13, 2025

      5 things we didn’t get from the Xbox Games Showcase, because Xbox obviously hates me personally

      June 13, 2025

      Minecraft Vibrant Visuals finally has a release date and it’s dropping with the Happy Ghasts

      June 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      QAQ-QQ-AI-QUEST

      June 13, 2025
      Recent

      QAQ-QQ-AI-QUEST

      June 13, 2025

      JS Dark Arts: Abusing prototypes and the Result type

      June 13, 2025

      Helpful Git Aliases To Maximize Developer Productivity

      June 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft Copilot’s own default configuration exposed users to the first-ever “zero-click” AI attack, but there was no data breach

      June 13, 2025
      Recent

      Microsoft Copilot’s own default configuration exposed users to the first-ever “zero-click” AI attack, but there was no data breach

      June 13, 2025

      Sam Altman says “OpenAI was forced to do a lot of unnatural things” to meet the Ghibli memes demand surge

      June 13, 2025

      5 things we didn’t get from the Xbox Games Showcase, because Xbox obviously hates me personally

      June 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

    NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

    June 11, 2025

    As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning. However, inference-time performance is severely limited by the memory footprint of the key–value (KV) cache, not just the number of tokens produced. In a recent paper, researchers from NVIDIA and the University of Edinburgh introduce Dynamic Memory Sparsification (DMS)—a data-efficient, retrofit-friendly method that compresses KV caches and unlocks inference-time hyper-scaling without degrading model accuracy.

    The Bottleneck: KV Cache in Transformer Inference

    Transformer-based models like GPT, LLaMA, and Qwen use KV caches to store past token representations for autoregressive generation. This cache grows linearly with sequence length and width (parallel threads), consuming large amounts of GPU memory and leading to slower inference due to frequent memory access.

    Existing techniques for KV cache optimization either rely on training-free heuristics—such as attention weight-based token eviction—or require heavy post-training retrofits like Dynamic Memory Compression (DMC). Both have significant downsides: the former tends to hurt accuracy, while the latter is computationally expensive.

    Dynamic Memory Sparsification DMS: Compression Without Compromise

    Dynamic Memory Sparsification DMS addresses these limitations with a hybrid approach: it sparsifies the KV cache like traditional pruning methods but does so with a minimal training overhead (~1,000 steps) and delayed eviction, which retains tokens temporarily after they’re marked for removal. This design preserves important context information and avoids abrupt accuracy drops.

    The core idea is to make eviction decisions differentiable during training using a Gumbel-sigmoid-based sampling mechanism. Tokens predicted for future eviction remain usable for a sliding window duration before being discarded, allowing the model to absorb their informational value more effectively.

    Efficient Retrofitting with Minimal Data

    Unlike DMC, which requires thousands of training steps and complex gradient-based optimization, DMS introduces no additional parameters per attention head. It reuses a small part of the attention mechanism (a single neuron) to predict eviction. This makes DMS ideal for retrofitting existing models without architectural changes.

    Empirical results show that with as few as 1K training steps, DMS can achieve 8× KV cache compression, preserving or even improving model performance across reasoning tasks.

    Benchmark Results: Scaling Performance Without Scaling Cost

    The research team tested DMS on reasoning-heavy benchmarks like:

    • AIME 2024 (advanced math)
    • MATH 500 (mathematical problem solving)
    • GPQA Diamond (hard science QA)
    • LiveCodeBench (code generation)

    Across model sizes—Qwen-R1 1.5B, 7B, and 32B—DMS improved exact-match performance by 9.1 points on AIME, 7.6 on GPQA, and 9.6 on LiveCodeBench, all under the same memory and compute budgets.

    When compared to top-performing baselines like Quest and TOVA, DMS consistently outperformed them in both KV cache read efficiency (runtime proxy) and peak memory usage, achieving better Pareto frontiers.

    General-Purpose Utility

    DMS also holds up in non-reasoning tasks. On short-context benchmarks like MMLU, GSM8K, and HellaSwag, DMS-maintained performance at compression ratios up to 4× with minimal degradation (~3.5 points). On long-context tasks like Needle-in-a-Haystack and Variable Tracking, DMS even surpassed the vanilla models, suggesting its potential to mitigate issues like information over-squashing in long sequences.

    Conclusion

    In conclusion, Dynamic Memory Sparsification (DMS) presents a practical and scalable solution for enhancing the inference-time efficiency of Transformer-based language models. By intelligently compressing the KV cache with minimal retraining, DMS enables models to reason over longer sequences or in parallel without increasing runtime or memory demands. Its consistent gains across a range of reasoning and general-purpose tasks highlight its versatility and effectiveness. As LLMs are increasingly deployed in resource-constrained environments, DMS offers a compelling path forward—balancing compression, accuracy, and ease of integration for real-world inference workloads.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    ▶ Looking to showcase your product, webinar, or service to over 1 million AI engineers, developers, data scientists, architects, CTOs, and CIOs? Let’s explore a strategic partnership

    The post NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMistral AI Releases Magistral Series: Advanced Chain-of-Thought LLMs for Enterprise and Open-Source Applications
    Next Article How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 13, 2025
    Machine Learning

    Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment

    June 13, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    TorrentLocker: Racketeering ransomware disassembled by ESET experts

    Development

    CVE-2025-4145 – Netgear EX6200 Remote Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47095 – Adobe Experience Manager Open Redirect

    Common Vulnerabilities and Exposures (CVEs)

    Impactful tips to enhance your website’s accessibility

    Web Development

    Highlights

    This hidden Chrome feature is my secret productivity trick – here’s my favorite way to use it

    June 12, 2025

    Ever notice the Google Lens icon in your Chrome toolbar? You might be missing out…

    Don’t Tread on Me Penguins Against Trump Shirt https://viralstyle.com/graydesigner/dont-tread-on-me-penguins-against-trump Make a bold statement with our “Don’t Tread on Me Penguins Against Trump” shirt. This eye-catching design features rebellious penguins standing up to Trump, blending humor with political activism. Perfect for protests, casual wear, or sparking conversation. Soft, high-quality cotton for all-day comfort. Wear your values loud and proud!

    April 5, 2025

    CVE-2025-45997 – Sourcecodester Web-based Pharmacy Product Management System File Upload Vulnerability

    May 28, 2025

    Dell PowerScale Vulnerability Let Attackers Gain Unauthorized Filesystem Access

    June 6, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.