Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 19, 2025

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 19, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 19, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 19, 2025

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025

      Microsoft Copilot gets OpenAI’s GPT-4o image generation support — but maybe a day late and a dollar short for the hype?

      May 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      ES6: Set Vs Array- What and When?

      May 19, 2025
      Recent

      ES6: Set Vs Array- What and When?

      May 19, 2025

      Transform JSON into Typed Collections with Laravel’s AsCollection::of()

      May 19, 2025

      Deployer

      May 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025
      Recent

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Neural Magic Releases LLM Compressor: A Novel Library to Compress LLMs for Faster Inference with vLLM

    Neural Magic Releases LLM Compressor: A Novel Library to Compress LLMs for Faster Inference with vLLM

    August 16, 2024

    Neural Magic has released the LLM Compressor, a state-of-the-art tool for large language model optimization that enables far quicker inference through much more advanced model compression. Hence, the tool is an important building block in Neural Magic’s pursuit of making high-performance open-source solutions available to the deep learning community, especially inside the vLLM framework.

    Image Source

    LLM Compressor reduces the difficulties that arise from the previously fragmented landscape of model compression tools, wherein users had to develop multiple bespoke libraries similar to AutoGPTQ, AutoAWQ, and AutoFP8 to apply certain quantization and compression algorithms. Such fragmented tools are folded into one library by LLM Compressor to easily apply state-of-the-art compression algorithms like GPTQ, SmoothQuant, and SparseGPT. These algorithms are implemented to create compressed models that offer reduced inference latency and maintain high levels of accuracy, which is critical for the model to be in production environments.

    The second key technical advancement the LLM Compressor brings is activation and weight quantization support. In particular, activation quantization is important to ensure that INT8 and FP8 tensor cores are utilized. These are optimized for high-performance computing on the new GPU architectures from NVIDIA, such as the Ada Lovelace and Hopper architectures. This is an important capability in accelerating compute-bound workloads where the computational bottleneck is eased by using lower-precision arithmetic units. It means that, by quantizing activations and weights, the LLM Compressor allows for up to a twofold increase in performance for inference tasks, mainly under high server loads. This is attested by large models like Llama 3.1 70B, which proves that using the LLM Compressor, the model achieves latency performance very close to that of an unquantized version running on four GPUs with just two.

    Image Source

    Besides activation quantization, the LLM Compressor supports state-of-the-art structured sparsity, 2:4, weight pruning with SparseGPT. This weight pruning removes redundant parameters selectively to reduce the loss in accuracy by dropping 50% of the model’s size. In addition to accelerating inference, this quantization-pruning combination minimizes the memory footprint and enables deployment on resource-constrained hardware for LLMs.

    The LLM Compressor was designed to integrate easily into any open-source ecosystem, particularly the Hugging Face model hub, via the painless loading and running of compressed models within vLLM. Further, the tool extends this by supporting a variety of quantization schemes, including fine-grained control over quantization, like per-tensor or per-channel on weights and per-tensor or per-token quantization on activation. This flexibility in the quantization strategy will allow very fine tuning concerning the demands on performance and accuracy from different models and deployment scenarios.

    Image Source

    Technically, the LLM Compressor is designed to work with various model architectures with extensibility. It has an aggressive roadmap for the tool, including extending support to MoE models, vision-language models, and non-NVIDIA hardware platforms. Other areas in the roadmap that are due for development include advanced quantization techniques such as AWQ and tools for creating non-uniform quantization schemes; those are expected to extend model efficiency further.

    In conclusion, the LLM Compressor thus becomes an important tool for researchers and practitioners alike in optimizing LLMs for deployment to production. It is open-source and has state-of-the-art features, making it easier to compress models and obtain heavy performance improvements without affecting the integrity of the models. The LLM Compressor and similar tools will play a very important role shortly when AI continues scaling in efficiently deploying large models on diverse hardware environments, making them more accessible for application in many other areas.

    Check out the GitHub Page and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 48k+ ML SubReddit

    Find Upcoming AI Webinars here

    Arcee AI Introduces Arcee Swarm: A Groundbreaking Mixture of Agents MoA Architecture Inspired by the Cooperative Intelligence Found in Nature Itself

    The post Neural Magic Releases LLM Compressor: A Novel Library to Compress LLMs for Faster Inference with vLLM appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCSSWG Minutes Telecon (2024-08-14)
    Next Article Nvidia AI Released Llama-Minitron 3.1 4B: A New Language Model Built by Pruning and Distilling Llama 3.1 8B

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 19, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48259 – Juan Carlos WP Mapa Politico España CSRF

    May 19, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-4174 – PHPGurukul COVID19 Testing Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    No, you’re not fired – but beware of job termination scams

    Development

    New ‘Cuckoo’ Persistent macOS Spyware Targeting Intel and Arm Macs

    Development

    Rilasciato Mozilla Thunderbird 134: Novità, Miglioramenti e Correzioni per il Client di Posta Open Source

    Linux

    Highlights

    Linux

    How I am Moving Away From Google’s Ecosystem

    January 28, 2025

    Google’s ecosystem includes several products and services. It is one of the prominent ecosystems on…

    Everything that happened at The Game Awards 2024: Winners, Game of the Year, and every new trailer

    January 7, 2025

    CVE-2025-3642 – Moodle EQUELLA Remote Code Execution Vulnerability

    April 25, 2025

    Microsoft finally confirms this annoying Windows 10 issue on taskbar & Start menu icons

    June 19, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.