Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 14, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 14, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 14, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 14, 2025

      I test a lot of AI coding tools, and this stunning new OpenAI release just saved me days of work

      May 14, 2025

      How to use your Android phone as a webcam when your laptop’s default won’t cut it

      May 14, 2025

      The 5 most customizable Linux desktop environments – when you want it your way

      May 14, 2025

      Gen AI use at work saps our motivation even as it boosts productivity, new research shows

      May 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025
      Recent

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025

      Perficient’s “What If? So What?” Podcast Wins Gold at the 2025 Hermes Creative Awards

      May 14, 2025

      PIM for Azure Resources

      May 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025
      Recent

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025

      You can now share an app/browser window with Copilot Vision to help you with different tasks

      May 14, 2025

      Microsoft will gradually retire SharePoint Alerts over the next two years

      May 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality

    FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality

    May 13, 2024

    Autoregressive language models (ALMs) have proven their capability in machine translation, text generation, etc. However, these models pose challenges, including computational complexity and GPU memory usage. Despite great success in various applications, there is an urgent need to find a cost-effective way to serve these models. Moreover, the generative inference of large language models (LLMs) utilizes the KV Cache mechanism to enhance the generation speed. Still, an increase in model size and generation length leads to an increase in memory usage of the KV cache. When memory usage exceeds GPU capacity, the generative inference of LLMs resorts to offloading.

    Many works have been carried out to enhance the model efficiency for LLMs, e.g., one such method is to skip multiple tokens at a particular time stamp. Recently, a technique that adds a token selection task to the original BERT model learns to select performance-crucial tokens and detect unimportant tokens to prune using a designed learnable threshold. However, these models are only applied to non-autoregressive models and require an extra re-training phrase, making them less suitable for auto-regressive LLMs like ChatGPT and Llama. It is important to consider pruning tokens’ potential within the KV cache of auto-regressive LLMs to fill this gap.   

    Researchers from the University of Illinois Urbana-Champaign and Microsoft proposed FastGen, a highly effective technique to enhance the inference efficiency of LLMs without any loss in visible quality, using lightweight model profiling and adaptive key-value caching. FastGen evicts long-range contexts on attention heads by the KV cache construction in an adaptive manner. Moreover, it is deployed using lightweight attention profiling, which has been used to guide the construction of the adaptive KV cache without resource-intensive fine-tuning or re-training. FastGen is capable of reducing GPU memory usage with negligible generation quality loss.

    The adaptive KV Cache compression introduced by the researchers reduces the memory footprint of generative inference for LLMs. In this method, there are two steps for a generative model inference which are involved:

    Prompt Encoding: The attention module needs to collect contextual information from all the preceding i-1 tokens for the i-th token generated by autoregressive transformer-based LLM.

    Token Generation: When prompt encoding is completed, LLM generates the output token by token, and for each step, the new token(s) generated in the previous step are encoded using the LLM. 

    For 30B models, FastGen outperforms all non-adaptive KV compression methods and achieves a higher KV cache reduction ratio with an increase in model size, keeping the model’s quality unaffected. For example, FastGen gets a 44.9% pruned ratio on Llama 1-65B, compared to a 16.9% pruned ratio on Llama 1-7B, achieving a 45% win rate. Further, sensitivity analysis was performed on FastGen by choosing different hyper-parameters. Since the model maintains a win rate of 45%, the study shows no visible impact on generation quality after changing the hyper-parameter.   

    In conclusion, researchers from the University of Illinois Urbana-Champaign and Microsoft proposed FastGen, a new technique to enhance LLMs inference efficiency with no loss in visible quality, using lightweight model profiling and adaptive key-value caching. Also, the adaptive KV Cache compression introduced by researchers is constructed using FastGen to reduce the memory footprint of generative inference for LLMs. Future work includes integrating FastGen with other model compression approaches, e.g., quantization and distillation, grouped-query attention, etc.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow ‘Chain of Thought’ Makes Transformers Smarter
    Next Article Web design trends to keep an eye on in 2024

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 15, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-3053 – “UiPress Lite WordPress Remote Code Execution Vulnerability”

    May 15, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    One of the most durable power stations I’ve tested is not made by Anker or Jackery

    Development

    AI for Beginners: Definition, Tools & Real-World Examples

    Development

    OpenAI study says punishing AI models for lying doesn’t help — It only sharpens their deceptive and obscure workarounds

    News & Updates

    Quantum Machine Learning for Accelerating EEG Signal Analysis

    Development

    Highlights

    Multi-Factor Authentication Policy

    February 21, 2025

    The need to safeguard sensitive data and systems from unauthorized access is always a major…

    How to Work with OpenAPI in Go

    February 19, 2025

    Svelte 5 And The Future Of Frameworks

    January 30, 2025

    CodeSOD: Actively Xing Out

    June 17, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.