KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression

In recent times, large language models (LLMs) built on the Transformer architecture have shown remarkable abilities across a wide range of tasks. However, these impressive capabilities usually come with a significant increase in model size, resulting in substantial GPU memory costs during inference. The KV cache is a popular method used in LLM inference. It saves the previously calculated keys and values in the attention process, which can then be reused to speed up future steps, making the inference process faster overall. Most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer, but few works consider layer-wise compression. The memory used by the KV (key-value) cache is mostly occupied by storing the key and value components of the attention map, which make up over 80% of the total memory usage. This makes system resources inefficient and creates a demand for more computational power.

Researchers have developed many methods to compress KV caches to reduce memory consumption. However, most of these researches are mainly concentrated on compressing the KV cache within each LLM Transformer layer. But, layer-wise KV cache compression strategies remain largely unexplored, which calculate the KV cache for only a subset of layers to minimize memory usage. The limited existing work on layer-wise KV cache compression typically requires additional training to maintain satisfactory performance. Most of the existing KV cache compression work, such as H2O, SnapKV, and PyramidInfer, are carried out within a single transformer layer, namely the intra-layer compression, but they do not address layer-wise KV cache compression. A few works like CLA, LCKV, Ayer, etc. have focused on layer-wise compression strategies for the KV cache. However, all of them require further training of the model rather than being plug-and-play on well-trained LLMs.

A group of researchers from Shanghai Jiao Tong University, Central South University, Harbin Institute of Technology, and ByteDance proposed KVSharer, a plug-and-play method for compressing the KV cache of well-trained LLMs. The researchers discovered the method, where KV caches differ greatly between two layers, sharing one layerâ€™s KV cache with the other during inference doesnâ€™t significantly reduce performance. Leveraging observations, KVSharer employs a search strategy to identify the KV cache-sharing strategy across different layers during inference. KVSharer significantly reduces GPU memory consumption while maintaining most of the model performance. As a layer-wise KV cache compression technique, KVSharer works well with existing methods that compress KV caches within each layer, providing an additional way to optimize memory in LLMs.

The main steps of KVSharer are divided into two parts. First, a given LLM searches for a sharing strategy, a list that specifies which layersâ€™ KV caches should be replaced by those of other specific layers. Then, during the next prefill and generation steps for all tasks, the KV caches are used.Â

An effective KV cache-sharing strategy for LLMs starts by measuring differences between the KV caches of each layer on a test dataset, focusing on sharing the most different pairs. KV caches are shared from one layer to another, with a priority for layers near the output to avoid any degradation in performance. Each shared pair is only kept if the output remains similar enough to the original. This process continues until the target number of shared layers is reached, resulting in a strategy that speeds up future tasks by reusing KV caches efficiently.

Researchers tested the KVSharer model on several English and bilingual models, including Llama2 and InternLM2, and found that it can compress data effectively with only small losses in performance. Using the OpenCompass benchmark, the group of researchers evaluated the modelâ€™s ability to reason, language, knowledge, and understand tasks with datasets like CMNLI, HellaSwag, and CommonSenseQA. At compression levels below 25%, KVSharer retained about 90-95% of the original modelâ€™s performance and worked well with other compression techniques like H2O and PyramidInfer, improving memory efficiency and processing speed. Tests on larger models, such as Llama2-70B, confirmed KVSharerâ€™s capability to compress cache effectively with minimal impact on performance.

In conclusion, the proposed KVSharer method offers an efficient solution for reducing memory consumption and improving inference speed in LLMs by leveraging a counterintuitive approach of sharing dissimilar KV caches. The experiments show that KVSharer maintains over 90% of the original performance of mainstream LLMs while reducing KV cache computation by 30%. It can also provide at least 1.3 times acceleration in generation. Additionally, KVSharer can be integrated with existing intra-layer KV cache compression methods to achieve even greater memory savings and faster inference. Hence, this method works well with current compression techniques, can be used for different tasks without needing extra training, and can be used as a base for future development in the domain.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

The post KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-48187 – RAGFlow Authentication Bypass

Optimizing Performance Without Compromising Design â€“ A Deep Dive

CVE-2025-3825 – SourceCodester Web-based Pharmacy Product Management System Cross-Site Scripting Vulnerability

I asked a Lenovo representative about the Legion Go S (SteamOS) price increase — This is what they told me

Bitter APT Targets Turkish Defense Sector with WmRAT and MiyaRAT Malware

How UI Components are Inspired from Real World Objects Rama Krushna Behera UX Planet

CVE-2025-4014 – PHPGurukul Art Gallery Management System SQL Injection Vulnerability

Easter Eggs of AI. Memes, Duplicates, Biases and Other AI Hallucinations and Why They Happen

Copilot ask, edit, and agent modes: What they do and when to use them

KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression

Related Posts