Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression

    KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression

    November 2, 2024

    In recent times, large language models (LLMs) built on the Transformer architecture have shown remarkable abilities across a wide range of tasks. However, these impressive capabilities usually come with a significant increase in model size, resulting in substantial GPU memory costs during inference. The KV cache is a popular method used in LLM inference. It saves the previously calculated keys and values in the attention process, which can then be reused to speed up future steps, making the inference process faster overall. Most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer, but few works consider layer-wise compression. The memory used by the KV (key-value) cache is mostly occupied by storing the key and value components of the attention map, which make up over 80% of the total memory usage. This makes system resources inefficient and creates a demand for more computational power.

    Researchers have developed many methods to compress KV caches to reduce memory consumption. However, most of these researches are mainly concentrated on compressing the KV cache within each LLM Transformer layer. But, layer-wise KV cache compression strategies remain largely unexplored, which calculate the KV cache for only a subset of layers to minimize memory usage. The limited existing work on layer-wise KV cache compression typically requires additional training to maintain satisfactory performance. Most of the existing KV cache compression work, such as H2O, SnapKV, and PyramidInfer, are carried out within a single transformer layer, namely the intra-layer compression, but they do not address layer-wise KV cache compression. A few works like CLA, LCKV, Ayer, etc. have focused on layer-wise compression strategies for the KV cache. However, all of them require further training of the model rather than being plug-and-play on well-trained LLMs.

    A group of researchers from Shanghai Jiao Tong University, Central South University, Harbin Institute of Technology, and ByteDance proposed KVSharer, a plug-and-play method for compressing the KV cache of well-trained LLMs. The researchers discovered the method, where KV caches differ greatly between two layers, sharing one layer’s KV cache with the other during inference doesn’t significantly reduce performance. Leveraging observations, KVSharer employs a search strategy to identify the KV cache-sharing strategy across different layers during inference. KVSharer significantly reduces GPU memory consumption while maintaining most of the model performance. As a layer-wise KV cache compression technique, KVSharer works well with existing methods that compress KV caches within each layer, providing an additional way to optimize memory in LLMs.

    The main steps of KVSharer are divided into two parts. First, a given LLM searches for a sharing strategy, a list that specifies which layers’ KV caches should be replaced by those of other specific layers. Then, during the next prefill and generation steps for all tasks, the KV caches are used. 

    An effective KV cache-sharing strategy for LLMs starts by measuring differences between the KV caches of each layer on a test dataset, focusing on sharing the most different pairs. KV caches are shared from one layer to another, with a priority for layers near the output to avoid any degradation in performance. Each shared pair is only kept if the output remains similar enough to the original. This process continues until the target number of shared layers is reached, resulting in a strategy that speeds up future tasks by reusing KV caches efficiently.

    Researchers tested the KVSharer model on several English and bilingual models, including Llama2 and InternLM2, and found that it can compress data effectively with only small losses in performance. Using the OpenCompass benchmark, the group of researchers evaluated the model’s ability to reason, language, knowledge, and understand tasks with datasets like CMNLI, HellaSwag, and CommonSenseQA. At compression levels below 25%, KVSharer retained about 90-95% of the original model’s performance and worked well with other compression techniques like H2O and PyramidInfer, improving memory efficiency and processing speed. Tests on larger models, such as Llama2-70B, confirmed KVSharer’s capability to compress cache effectively with minimal impact on performance.


    In conclusion, the proposed KVSharer method offers an efficient solution for reducing memory consumption and improving inference speed in LLMs by leveraging a counterintuitive approach of sharing dissimilar KV caches. The experiments show that KVSharer maintains over 90% of the original performance of mainstream LLMs while reducing KV cache computation by 30%. It can also provide at least 1.3 times acceleration in generation. Additionally, KVSharer can be integrated with existing intra-layer KV cache compression methods to achieve even greater memory savings and faster inference. Hence, this method works well with current compression techniques, can be used for different tasks without needing extra training, and can be used as a base for future development in the domain.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

    [Trending] LLMWare Introduces Model Depot: An Extensive Collection of Small Language Models (SLMs) for Intel PCs

    The post KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleResearchers at KAUST Use Anderson Exploitation to Maximize GPU Efficiency with Greater Model Accuracy and Generalizability
    Next Article iP-VAE: A Spiking Neural Network for Iterative Bayesian Inference and ELBO Maximization

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48187 – RAGFlow Authentication Bypass

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Optimizing Performance Without Compromising Design – A Deep Dive

    Development

    CVE-2025-3825 – SourceCodester Web-based Pharmacy Product Management System Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    I asked a Lenovo representative about the Legion Go S (SteamOS) price increase — This is what they told me

    News & Updates

    Bitter APT Targets Turkish Defense Sector with WmRAT and MiyaRAT Malware

    Development

    Highlights

    How UI Components are Inspired from Real World Objects Rama Krushna Behera UX Planet

    March 27, 2025

    Ever wondered why clicking a button on a website feels so natural? It’s because much…

    CVE-2025-4014 – PHPGurukul Art Gallery Management System SQL Injection Vulnerability

    April 28, 2025

    Easter Eggs of AI. Memes, Duplicates, Biases and Other AI Hallucinations and Why They Happen

    June 19, 2024

    Copilot ask, edit, and agent modes: What they do and when to use them

    May 2, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.