Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity

    Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity

    May 18, 2024

    Natural Language Processing (NLP) is a cutting-edge field that enables machines to understand, interpret, & generate human language. It has applications in various domains, such as language translation, text summarization, sentiment analysis, and the development of conversational agents. Large language models (LLMs) have significantly advanced these applications by leveraging vast data to perform tasks with high accuracy, almost matching human performance.

    Today’s primary challenge in NLP is the enormous computational and energy demands required to train and deploy these LLMs. Their sheer size often limits these models, making them expensive and less accessible to a broader audience. The high computational cost and significant energy impact restrict the usability of these models, emphasizing the need to reduce the computational footprint without compromising accuracy. Addressing this challenge is crucial for making these powerful tools more widely available and sustainable.

    Various methods have been employed to mitigate these challenges and reduce LLMs’ size and computational requirements. Quantization is one technique that reduces the number of bits required to represent each model parameter, while pruning involves removing less important weights to streamline the model. However, both methods face significant hurdles in maintaining high accuracy, especially for complex tasks. Current techniques often struggle to achieve meaningful compression ratios without damaging model performance, particularly at high sparsity levels.

    Researchers from Neural Magic, Cerebras Systems, and IST Austria have introduced a novel approach to create sparse foundational versions of large language models. They specifically targeted the LLaMA-2 7B model, aiming to combine the SparseGPT pruning method with sparse pretraining techniques. This innovative method seeks to achieve high sparsity levels while preserving or enhancing the model’s accuracy. The researchers’ approach involves initially pruning the model to 50% sparsity, followed by further iterative training and pruning steps to reach 70% sparsity. 

    The method begins with sparse pretraining on subsets of high-quality datasets such as SlimPajama and The Stack. The sparse pretraining process includes fine-tuning with per-layer distillation, ensuring the model retains high accuracy across various complex tasks, including chat, code generation, and instruction following. This detailed process involves training the 50% sparse model until convergence and then pruning it further to achieve the 70% target. The weights are pruned and frozen, and sparsity masks are enforced during training to maintain the desired sparsity levels. This iterative process is crucial for maintaining high recovery levels after fine-tuning.

    The sparse models demonstrated the ability to achieve up to 70% sparsity while fully recovering accuracy for fine-tuning tasks. Training acceleration on Cerebras CS-3 chips closely matched theoretical scaling, showcasing the efficiency of the approach. Inference speeds increased significantly, with improvements of up to 3x on CPUs using Neural Magic’s DeepSparse engine and 1.7x on GPUs using the nm-vllm engine. Additionally, the combination of sparsity and quantization resulted in total speedups on CPUs reaching up to 8.6x, highlighting the method’s efficiency and effectiveness.

    The study’s results underscore the potential of combining sparsity with quantization to achieve dramatic speedups and performance gains. The sparse pretraining methodology proved particularly effective, demonstrating high recovery at up to 70% sparsity levels. The integration of Cerebras’s CS-3 AI accelerator for sparse pretraining further highlighted the advantages of this approach, enabling near-ideal speedups and significantly reducing computational requirements.

    In conclusion, this research successfully addresses the challenge of reducing the computational demands of LLMs while maintaining their performance. The innovative sparse pretraining and deployment techniques introduced by the Neural Magic, Cerebras Systems, and IST Austria researchers offer a promising solution to the problem. This approach not only enhances the efficiency and accessibility of NLP models but also sets the stage for future advancements in the field.

    Check out the Paper and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeta AI Introduces Chameleon: A New Family of Early-Fusion Token-based Foundation Models that Set a New Bar for Multimodal Machine Learning
    Next Article This AI Research from Google DeepMind Explores the Performance Gap between Online and Offline Methods for AI Alignment

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    As Bill Gates says AI will replace white-collar jobs, one expert says two professions are on the chopping block

    News & Updates

    Hackers Exploited Krpano Framework Flaw to Inject Spam Ads on 350+ Websites

    Development

    Ten Wild Examples of Llama 3.1 Use Cases

    Development

    The Razer headset I haven’t stopped using since I reviewed it now has an Xbox version, and it’s predictably awesome

    Development

    Highlights

    Development

    Handling Process Synchronization with Laravel Cache Locks

    January 16, 2025

    Learn how to manage process synchronization in Laravel using Cache locks. Discover how to prevent…

    HCL Commerce V9.1 – The Power of the Next.js Ruby Storefront

    May 8, 2025

    CISA Adds Twilio Authy and IE Flaws to Exploited Vulnerabilities List

    July 26, 2024

    User Registration and Login using PHP and PostgreSQL

    December 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.