Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 19, 2025

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 19, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 19, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 19, 2025

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025

      Microsoft Copilot gets OpenAI’s GPT-4o image generation support — but maybe a day late and a dollar short for the hype?

      May 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      ES6: Set Vs Array- What and When?

      May 19, 2025
      Recent

      ES6: Set Vs Array- What and When?

      May 19, 2025

      Transform JSON into Typed Collections with Laravel’s AsCollection::of()

      May 19, 2025

      Deployer

      May 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025
      Recent

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Neural Magic Releases LLM Compressor: A Novel Library to Compress LLMs for Faster Inference with vLLM

    Neural Magic Releases LLM Compressor: A Novel Library to Compress LLMs for Faster Inference with vLLM

    August 16, 2024

    Neural Magic has released the LLM Compressor, a state-of-the-art tool for large language model optimization that enables far quicker inference through much more advanced model compression. Hence, the tool is an important building block in Neural Magic’s pursuit of making high-performance open-source solutions available to the deep learning community, especially inside the vLLM framework.

    Image Source

    LLM Compressor reduces the difficulties that arise from the previously fragmented landscape of model compression tools, wherein users had to develop multiple bespoke libraries similar to AutoGPTQ, AutoAWQ, and AutoFP8 to apply certain quantization and compression algorithms. Such fragmented tools are folded into one library by LLM Compressor to easily apply state-of-the-art compression algorithms like GPTQ, SmoothQuant, and SparseGPT. These algorithms are implemented to create compressed models that offer reduced inference latency and maintain high levels of accuracy, which is critical for the model to be in production environments.

    The second key technical advancement the LLM Compressor brings is activation and weight quantization support. In particular, activation quantization is important to ensure that INT8 and FP8 tensor cores are utilized. These are optimized for high-performance computing on the new GPU architectures from NVIDIA, such as the Ada Lovelace and Hopper architectures. This is an important capability in accelerating compute-bound workloads where the computational bottleneck is eased by using lower-precision arithmetic units. It means that, by quantizing activations and weights, the LLM Compressor allows for up to a twofold increase in performance for inference tasks, mainly under high server loads. This is attested by large models like Llama 3.1 70B, which proves that using the LLM Compressor, the model achieves latency performance very close to that of an unquantized version running on four GPUs with just two.

    Image Source

    Besides activation quantization, the LLM Compressor supports state-of-the-art structured sparsity, 2:4, weight pruning with SparseGPT. This weight pruning removes redundant parameters selectively to reduce the loss in accuracy by dropping 50% of the model’s size. In addition to accelerating inference, this quantization-pruning combination minimizes the memory footprint and enables deployment on resource-constrained hardware for LLMs.

    The LLM Compressor was designed to integrate easily into any open-source ecosystem, particularly the Hugging Face model hub, via the painless loading and running of compressed models within vLLM. Further, the tool extends this by supporting a variety of quantization schemes, including fine-grained control over quantization, like per-tensor or per-channel on weights and per-tensor or per-token quantization on activation. This flexibility in the quantization strategy will allow very fine tuning concerning the demands on performance and accuracy from different models and deployment scenarios.

    Image Source

    Technically, the LLM Compressor is designed to work with various model architectures with extensibility. It has an aggressive roadmap for the tool, including extending support to MoE models, vision-language models, and non-NVIDIA hardware platforms. Other areas in the roadmap that are due for development include advanced quantization techniques such as AWQ and tools for creating non-uniform quantization schemes; those are expected to extend model efficiency further.

    In conclusion, the LLM Compressor thus becomes an important tool for researchers and practitioners alike in optimizing LLMs for deployment to production. It is open-source and has state-of-the-art features, making it easier to compress models and obtain heavy performance improvements without affecting the integrity of the models. The LLM Compressor and similar tools will play a very important role shortly when AI continues scaling in efficiently deploying large models on diverse hardware environments, making them more accessible for application in many other areas.

    Check out the GitHub Page and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 48k+ ML SubReddit

    Find Upcoming AI Webinars here

    Arcee AI Introduces Arcee Swarm: A Groundbreaking Mixture of Agents MoA Architecture Inspired by the Cooperative Intelligence Found in Nature Itself

    The post Neural Magic Releases LLM Compressor: A Novel Library to Compress LLMs for Faster Inference with vLLM appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCSSWG Minutes Telecon (2024-08-14)
    Next Article Nvidia AI Released Llama-Minitron 3.1 4B: A New Language Model Built by Pruning and Distilling Llama 3.1 8B

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 19, 2025
    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    May 19, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-4534 – SunGrow Logger1000 Remote Weak Password Requirements Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Rilasciati GParted 1.7 e GParted Live 1.7 con il Supporto Sperimentale per Bcachefs e Altri Aggiornamenti

    Linux

    Custom Website Design vs DIY: Which is Right for Your Business?

    Development

    Cyber insurance as part of the cyber threat mitigation strategy

    Development

    Highlights

    Linux

    SSH into Raspberry Pi from Outside Home Network Using Tailscale

    February 12, 2025

    Earlier, I shared how you can use Cloudflare Tunnels to access Raspberry Pi outside your…

    Fine tune a generative AI application for Amazon Bedrock using Amazon SageMaker Pipeline decorators

    August 22, 2024

    Microsoft previews Spanish voice features for its Copilot Voice AI assistant

    April 9, 2025

    Bookspotz: The Real Legends’ behind the idea!

    February 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.