In recent times, large language models (LLMs) built on the Transformer architecture have shown remarkable abilities across a wide range of tasks. However, these impressive capabilities usually come with a significant increase in model size, resulting in substantial GPU memory costs during inference. The KV cache is a popular method used in LLM inference. It saves the previously calculated keys and values in the attention process, which can then be reused to speed up future steps, making the inference process faster overall. Most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer, but few works consider layer-wise compression. The memory used by the KV (key-value) cache is mostly occupied by storing the key and value components of the attention map, which make up over 80% of the total memory usage. This makes system resources inefficient and creates a demand for more computational power.
Researchers have developed many methods to compress KV caches to reduce memory consumption. However, most of these researches are mainly concentrated on compressing the KV cache within each LLM Transformer layer. But, layer-wise KV cache compression strategies remain largely unexplored, which calculate the KV cache for only a subset of layers to minimize memory usage. The limited existing work on layer-wise KV cache compression typically requires additional training to maintain satisfactory performance. Most of the existing KV cache compression work, such as H2O, SnapKV, and PyramidInfer, are carried out within a single transformer layer, namely the intra-layer compression, but they do not address layer-wise KV cache compression. A few works like CLA, LCKV, Ayer, etc. have focused on layer-wise compression strategies for the KV cache. However, all of them require further training of the model rather than being plug-and-play on well-trained LLMs.
A group of researchers from Shanghai Jiao Tong University, Central South University, Harbin Institute of Technology, and ByteDance proposed KVSharer, a plug-and-play method for compressing the KV cache of well-trained LLMs. The researchers discovered the method, where KV caches differ greatly between two layers, sharing one layer’s KV cache with the other during inference doesn’t significantly reduce performance. Leveraging observations, KVSharer employs a search strategy to identify the KV cache-sharing strategy across different layers during inference. KVSharer significantly reduces GPU memory consumption while maintaining most of the model performance. As a layer-wise KV cache compression technique, KVSharer works well with existing methods that compress KV caches within each layer, providing an additional way to optimize memory in LLMs.
The main steps of KVSharer are divided into two parts. First, a given LLM searches for a sharing strategy, a list that specifies which layers’ KV caches should be replaced by those of other specific layers. Then, during the next prefill and generation steps for all tasks, the KV caches are used.Â
An effective KV cache-sharing strategy for LLMs starts by measuring differences between the KV caches of each layer on a test dataset, focusing on sharing the most different pairs. KV caches are shared from one layer to another, with a priority for layers near the output to avoid any degradation in performance. Each shared pair is only kept if the output remains similar enough to the original. This process continues until the target number of shared layers is reached, resulting in a strategy that speeds up future tasks by reusing KV caches efficiently.
Researchers tested the KVSharer model on several English and bilingual models, including Llama2 and InternLM2, and found that it can compress data effectively with only small losses in performance. Using the OpenCompass benchmark, the group of researchers evaluated the model’s ability to reason, language, knowledge, and understand tasks with datasets like CMNLI, HellaSwag, and CommonSenseQA. At compression levels below 25%, KVSharer retained about 90-95% of the original model’s performance and worked well with other compression techniques like H2O and PyramidInfer, improving memory efficiency and processing speed. Tests on larger models, such as Llama2-70B, confirmed KVSharer’s capability to compress cache effectively with minimal impact on performance.
In conclusion, the proposed KVSharer method offers an efficient solution for reducing memory consumption and improving inference speed in LLMs by leveraging a counterintuitive approach of sharing dissimilar KV caches. The experiments show that KVSharer maintains over 90% of the original performance of mainstream LLMs while reducing KV cache computation by 30%. It can also provide at least 1.3 times acceleration in generation. Additionally, KVSharer can be integrated with existing intra-layer KV cache compression methods to achieve even greater memory savings and faster inference. Hence, this method works well with current compression techniques, can be used for different tasks without needing extra training, and can be used as a base for future development in the domain.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.
[Trending] LLMWare Introduces Model Depot: An Extensive Collection of Small Language Models (SLMs) for Intel PCs
The post KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression appeared first on MarkTechPost.
Source: Read MoreÂ