Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 17, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 17, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 17, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 17, 2025

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025

      Save $400 on the best Samsung TVs, laptops, tablets, and more when you sign up for Verizon 5G Home or Home Internet

      May 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025
      Recent

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025

      Big Changes at Meteor Software: Our Next Chapter

      May 17, 2025

      Apps in Generative AI – Transforming the Digital Experience

      May 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025
      Recent

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»MIT Researchers Propose Cross-Layer Attention (CLA): A Modification to the Transformer Architecture that Reduces the Size of the Key-Value KV Cache by Sharing KV Activations Across Layers

    MIT Researchers Propose Cross-Layer Attention (CLA): A Modification to the Transformer Architecture that Reduces the Size of the Key-Value KV Cache by Sharing KV Activations Across Layers

    May 25, 2024

    The memory footprint of the key-value (KV) cache can be a bottleneck when serving large language models (LLMs), as it scales proportionally with both sequence length and batch size. This overhead limits batch sizes for long sequences and necessitates costly techniques like offloading when on-device memory is scarce. Furthermore, the ability to persistently store and retrieve KV caches over extended periods is desirable to avoid redundant computations. However, the size of the KV cache directly impacts the cost and feasibility of storing and retrieving these persistent caches. As LLM applications increasingly demand longer input sequences, the memory requirements of the KV cache have become a critical consideration in designing efficient transformer-based language models.

    Traditionally, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) have been employed to reduce the KV cache size. The original transformer architecture employed Multi-Head Attention (MHA), where each query head attends to the keys and values produced by a distinct key/value head. To reduce the overhead of storing and accessing the KV cache during decoding, MQA organizes the query heads into groups, with each group sharing a single key/value head. GQA generalizes this idea by allowing varying numbers of groups. Since the KV cache size scales only with the number of distinct key/value heads, MQA and GQA effectively reduce the storage overhead. However, these techniques have limitations in terms of achievable memory reduction.

    In this paper, researchers from MIT have developed a method called Cross-Layer Attention (CLA) that extends the idea of key/value head sharing. A diagrammatic view of it is presented in Figure 1. CLA enables the sharing of key and value heads not only within a layer but also across adjacent layers. By computing key/value projections for only a subset of layers and allowing other layers to reuse KV activations from previous layers, CLA achieves a significant reduction in the KV cache memory footprint. The reduction factor is equal to the sharing factor or slightly less if the sharing factor does not evenly divide the number of layers.

    CLA is orthogonal to MQA and GQA, meaning it can be combined with either technique. The sharing factor determines the number of layers that share the output of each KV projection, governing different CLA configurations. For example as shown in Figure 2, CLA2 shares each KV projection among a pair of adjacent layers, while CLA3 shares it among a group of three layers. 

    Let’s now see some benefits of CLA: It reduces the memory footprint of intermediate KV activation tensors materialized during training, although this reduction is typically small compared to the model’s hidden states and MLP activations. CLA is fully compatible with standard tensor parallelism techniques for sharding model weights across multiple accelerators. In the presence of pipeline parallelism, either different layers sharing a KV cache must be kept in the same pipeline stage, or KV activations must be communicated between pipeline stages. By reducing the total number of key/value projection blocks, CLA slightly decreases the number of parameters in the model and the number of FLOPs required during forward or backward passes. Importantly, CLA enables larger batch sizes and longer KV cache persistence times, which have the potential to improve inference latency in the context of a full LLM serving stack. However, unlike MQA and GQA, CLA has no direct effect on the memory bandwidth consumed by the attention mechanism in each decoding step or the latency of the core attention computation during decoding.

    To assess CLA’s efficacy, the researchers trained transformer-based language models from scratch at the 1 billion and 3 billion parameter scales. Their experiments aimed to answer questions like what accuracy/memory tradeoffs are possible using CLA, how it compares to plain GQA or MQA, how it interacts with these techniques, what CLA configurations perform best given a fixed memory budget, and whether the effects are consistent across scales.

    The key findings of the experiment are as follows: CLA enables favorable accuracy/memory tradeoffs compared to plain GQA or MQA. A sharing factor of 2 (CLA2) was more effective than other sharing factors in the experimental regime. CLA was consistently effective when combined with MQA to decrease KV cache storage. CLA models benefited from training with higher learning rates than comparable non-CLA models. The benefits were consistent across both 1B- and 3B-parameter scales.

    Quantitatively, MQA-CLA2 consistently achieved the lowest validation perplexity (within 0.01 points) for a given KV cache memory budget and model size. At both 1B and 3B scales, for MQA models with typical head sizes of 64 and 128, applying CLA2 yielded a 2× KV cache reduction while incurring, at worst, a very modest (less than 1% change) degradation in perplexity, and in some cases, even improving perplexity. The researchers recommend the MQA-CLA2 recipe to practitioners as a conservative change to existing MQA architectures that deliver substantial memory overhead reductions with relatively little risk.

    The researchers suspect that the LLMs that will gain the most from CLA are those with extremely long sequences, such as models with long-term memory or those using Landmark Attention, which renders attention over long contexts more feasible. However, they leave end-to-end inference efficiency evaluations of large, long-context models employing CLA as an interesting problem for future work.

    In conclusion, Cross-Layer Attention (CLA) emerges as an effective method for reducing the KV cache memory storage footprint of transformer models by a factor of 2× with roughly equal perplexity compared to existing techniques. Based on extensive experimental evaluation against well-tuned baselines at both the 1B- and 3B-parameter scales, CLA advances the Pareto frontier for memory-efficient transformers, making it a promising solution for memory-constrained applications of large language models.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post MIT Researchers Propose Cross-Layer Attention (CLA): A Modification to the Transformer Architecture that Reduces the Size of the Key-Value KV Cache by Sharing KV Activations Across Layers appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleElia: An Open Source Terminal UI for Interacting with LLMs
    Next Article Enhancing Security and Efficiency: The Integral Role of AI in Advanced Cryptocurrency Systems

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 17, 2025
    Development

    Learn A1 Level Spanish

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best Free and Open Source Alternatives to Corel Vector

    Linux

    CLEAR Method skipped while running the tests

    Development

    CVE-2025-47809 – Wibu CodeMeter Privilege Escalation Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Your Idea Factory – Quality Innovation from Quantity

    Development
    Hostinger

    Highlights

    CVE-2025-45429 – Tenda ac9 Stack Overflow Vulnerability

    April 23, 2025

    CVE ID : CVE-2025-45429

    Published : April 23, 2025, 4:15 p.m. | 2 hours, 43 minutes ago

    Description : In the Tenda ac9 v1.0 router with firmware V15.03.05.14_multi, there is a stack overflow vulnerability in /goform/WifiWpsStart, which may lead to remote arbitrary code execution.

    Severity: 9.8 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    How to Use LangChain and GPT to Analyze Multiple Documents

    November 6, 2024

    Researchers from Salesforce, The University of Tokyo, UCLA, and Northeastern University Propose the Inner Thoughts Framework: A Novel Approach to Proactive AI in Multi-Party Conversations

    January 6, 2025

    Czech Mobile Users Targeted in New Banking Credential Theft Scheme

    August 20, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.