An Efficient AI Approach to Memory Reduction and Throughput Enhancement in LLMs

The efficient deployment of large language models (LLMs) necessitates high throughput and low latency. However, LLMsâ€™ substantial memory consumption, particularly by the key-value (KV) cache, hinders achieving large batch sizes and high throughput. The KV cache, storing keys and values during generation, consumes over 30% of GPU memory. Various approaches such as compressing KV sequences and dynamic cache eviction policies, aim to alleviate this memory burden in LLMs.

While most works concentrate on compressing KV sequences, vLLM introduces paged attention to mitigate memory fragmentation. Various approaches include compressing prompts, removing input context redundancy, and incrementally compressing token spans. Other methods involve pruning unimportant tokens, applying different pruning strategies to attention heads, and storing only crucial tokens.

Researchers from the School of Information Science and Technology, ShanghaiTech University, and Shanghai Engineering Research Center of Intelligent Vision and Imaging present an efficient approach to reduce memory consumption in the KV cache of transformer decoders by decreasing the number of cached layers. By pairing queries of all layers with keys and values of just the top layer, only one layerâ€™s keys and values need to be cached, significantly saving memory without additional computation overhead. Inspired by the iterative improvement process of token representation, the model attends only to the top layer, akin to cross-attention in standard transformers. The model integrates standard attention for a few layers to mitigate performance degradation.Â

The proposed method pairs queries of all layers with keys and values of only the top layer, eliminating the need to cache or compute KVs for other layers, thereby saving memory and computation. This also reduces model parameters by eliminating the need for weights mapping hidden representations to KVs for those layers. To address the cyclic dependency problem caused by each token attending to itself, the model masks the diagonal of the attention matrix, using zero vectors as dummy KVs for the first token. Retaining standard attention for a few layers, termed warmup layers, maintains the syntactic-semantic pattern observed in transformers, ensuring competitive performance with standard models.

Researchers evaluated their method using models with 1.1B, 7B, and 30B parameters on different GPUs, including NVIDIA GeForce RTX 3090 and A100. The implementation utilizes HuggingFace Transformers with FlashAttention 2, fused RMS norm, fused cross-entropy, and fused SwiGLU. Evaluation measures include latency and throughput, with results indicating significantly larger batch sizes and higher throughput than standard Llama models across various settings. Also, zero-shot accuracy on commonsense reasoning tasks is comparable to TinyLlama. Integration with StreamingLLM demonstrates lower latency and memory consumption, with the ability to process infinite-length tokens effectively. The method achieves competitive performance and higher inference efficiency, although pre-training requires more time due to the iterative training process.

This study presents a robust method to reduce memory consumption and enhance throughput in LLMs by minimizing the number of layers requiring key and value computation and caching. Empirical results demonstrate substantial memory reduction and throughput improvement with minimal performance loss. Also, the method seamlessly integrates with other memory-saving techniques like StreamingLLM.Â

Check out theÂ Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 42k+ ML SubReddit

The post An Efficient AI Approach to Memory Reduction and Throughput Enhancement in LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Build Confidence In Your UX Work

The Palworld dating sim is real and so is the Lovander body pillow

TeamRICOCHET unveils plan to use machine learning in its effort to ramp up Call of Duty anti-cheat efforts

Bill Gates would restart Microsoft as an AI-centric lab after 50 years — “Raising billions of dollars from a few sketch ideas”

Monster Hunter Wilds celebrates 10 million copies sold as Capcom fully admits it made the game too easy — Plans to increase challenges in the coming months

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PECL Releases (03.11.2025)

Community News: Latest PEAR Releases (03.31.2025)

The Palworld dating sim is real and so is the Lovander body pillow

The Palworld dating sim is real and so is the Lovander body pillow

TeamRICOCHET unveils plan to use machine learning in its effort to ramp up Call of Duty anti-cheat efforts

Bill Gates would restart Microsoft as an AI-centric lab after 50 years — “Raising billions of dollars from a few sketch ideas”

An Efficient AI Approach to Memory Reduction and Throughput Enhancement in LLMs

ruby-align is Baseline Newly available

February 2025 Baseline monthly digest

Best free icon sets for UI design & development

Hackers Exploiting Jenkins Script Console for Cryptocurrency Mining Attacks

Microsoft Surface Laptop may rival MacBook with a Trackpad that doubles as a speaker

Complete CSS Course

Microsoft AI Researchers Release LLaVA-Rad: A Lightweight Open-Source Foundation Model for Advanced Clinical Radiology Report Generation

Petition filed to cancel Oracleâ€™s trademark for JavaScript

The Role of Technology in Enhancing Patient Acceptance and Satisfaction

Error Reply Message Mismatch 595 (0x253): How to Fix it

An Efficient AI Approach to Memory Reduction and Throughput Enhancement in LLMs

Related Posts