Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»TransMLA: Transforming GQA-based Models Into MLA-based Models

    TransMLA: Transforming GQA-based Models Into MLA-based Models

    February 16, 2025

    Large Language Models (LLMs) have gained significant importance as productivity tools, with open-source models increasingly matching the performance of their closed-source counterparts. These models operate through Next Token Prediction, where tokens are predicted in sequence when computing attention is between each token and its predecessors. Key-value (KV) pairs are cached to prevent redundant calculations and optimize this process. However, the increasing memory requirements for caching pose substantial limitations, particularly evident in models like LLaMA-65B, which requires over 86GB of GPU memory to store 512K tokens with 8-bit key-value quantization, exceeding even high-capacity GPUs like the H100-80GB.

    Existing approaches have emerged to address the memory footprint challenges of KV cache in LLMs, each having its own advantages and disadvantages. Linear attention methods like Linear Transformer, RWKV, and Mamba provide linear scaling with sequence length. Dynamic token pruning approaches such as LazyLLM, A2SF, and SnapKV remove less important tokens, while head dimension reduction techniques like SliceGPT and Sheared focus on reducing attention heads. Methods for sharing KV representations across layers, including YONO and MiniCache, and quantization techniques like GPTQ and KVQuant, attempt to optimize memory usage. However, these approaches consistently face trade-offs between computational efficiency and model performance, often sacrificing essential information or attention patterns.

    Researchers from Peking University, and Xiaomi Corp., Beijing have proposed TransMLA, a post-training method that converts widely used GQA-based pre-trained models into MLA-based models. Their research provides theoretical proof that Multi-Layer Attention (MLA) delivers superior expressive power compared to Grouped-Query Attention (GQA) while maintaining the same KV Cache overhead. The team has successfully converted several prominent GQA-based models, including LLaMA-3, Qwen-2.5, Mistral, Mixtral, Gemma-2, and Phi-4, into equivalent MLA models. This transformation aims to revolutionize mainstream LLM attention design by offering a resource-efficient migration strategy that improves model performance while reducing computational costs and environmental impact.

    The transformation from GQA to MLA models is shown using the Qwen2.5 framework. In the original Qwen2.5-7B model, each layer contains 28 query heads and 4 key/value heads, with individual head dimensions of 128 and a KV cache dimension of 1024. The conversion to MLA involves adjusting the output dimensions of two weight matrices to 512 while maintaining the KV cache dimension at 1024. The key innovation lies in the TransMLA approach, which projects the weight matrix dimensions from 512 to 3584, enabling all 28 query heads to interact with distinct queries. This transformation substantially enhances the model’s expressive power while keeping the KV cache size constant and adding only a modest 12.5% increase in parameters for both QK and V-O pairs.

    The performance evaluation of the TransMLA model shows significant improvements over the original GQA-based architecture. Using the SmolTalk instruction fine-tuning dataset, the TransMLA model achieves lower training loss, indicating enhanced data fitting capabilities. Performance improvements are seen mostly in math and code tasks across both 7B and 14B model configurations. The research investigated the source of these improvements through controlled experiments. When testing with simple dimensionality expansion using identity map initialization without orthogonal decomposition on the GSM8K dataset, the improvement is minimal (0.15%), confirming that the substantial performance gains come from the combination of enlarged KV dimensions and orthogonal decomposition.

    In conclusion, researchers present a significant advancement in LLM architecture by introducing TransMLA, an approach to convert used GQA-based pre-trained models into MLA-based models. The theoretical proofs and empirical validation establish the successful transformation with enhanced performance characteristics. This work bridges a critical gap between GQA and MLA architectures in existing research through comprehensive theoretical and experimental comparisons. Moreover, Future developments can focus on extending this transformation approach to major large-scale models like LLaMA, Qwen, and Mistral, with additional optimization through DeepSeek R1 distillation techniques to improve model performance.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post TransMLA: Transforming GQA-based Models Into MLA-based Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle DeepMind Researchers Propose Matryoshka Quantization: A Technique to Enhance Deep Learning Efficiency by Optimizing Multi-Precision Models without Sacrificing Accuracy
    Next Article $10 UI Design Vs $100,000 UI Design – How Can Someone Charge So Much?

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    NativePHP Hit $100K — And We’re Just Getting Started 🚀

    Development

    Apple Fitness Plus gets a big update for the new year: 5 new or improved features

    Development

    StilachiRAT comes after your credentials and crypto wallet, warns Microsoft

    Operating Systems

    Secure distributed logging in scalable multi-account deployments using Amazon Bedrock and LangChain

    Machine Learning
    Hostinger

    Highlights

    News & Updates

    Intel’s bold security claims: Mudslinging or genuine warnings for AMD & NVIDIA?

    February 12, 2025

    The latest Intel Security report lists several ways in which Intel CPUs and GPUs are…

    Streamlined Resource Management in TypeScript: Mastering using

    November 18, 2024

    CVE-2025-4731 – TOTOLINK HTTP POST Request Handler Buffer Overflow Vulnerability

    May 15, 2025

    A Step-by-Step Coding Guide to Defining Custom Model Context Protocol (MCP) Server and Client Tools with FastMCP and Integrating Them into Google Gemini 2.0’s Function‑Calling Workflow

    April 21, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.