Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity

Natural Language Processing (NLP) is a cutting-edge field that enables machines to understand, interpret, & generate human language. It has applications in various domains, such as language translation, text summarization, sentiment analysis, and the development of conversational agents. Large language models (LLMs) have significantly advanced these applications by leveraging vast data to perform tasks with high accuracy, almost matching human performance.

Todayâ€™s primary challenge in NLP is the enormous computational and energy demands required to train and deploy these LLMs. Their sheer size often limits these models, making them expensive and less accessible to a broader audience. The high computational cost and significant energy impact restrict the usability of these models, emphasizing the need to reduce the computational footprint without compromising accuracy. Addressing this challenge is crucial for making these powerful tools more widely available and sustainable.

Various methods have been employed to mitigate these challenges and reduce LLMsâ€™ size and computational requirements. Quantization is one technique that reduces the number of bits required to represent each model parameter, while pruning involves removing less important weights to streamline the model. However, both methods face significant hurdles in maintaining high accuracy, especially for complex tasks. Current techniques often struggle to achieve meaningful compression ratios without damaging model performance, particularly at high sparsity levels.

Researchers from Neural Magic, Cerebras Systems, and IST Austria have introduced a novel approach to create sparse foundational versions of large language models. They specifically targeted the LLaMA-2 7B model, aiming to combine the SparseGPT pruning method with sparse pretraining techniques. This innovative method seeks to achieve high sparsity levels while preserving or enhancing the modelâ€™s accuracy. The researchersâ€™ approach involves initially pruning the model to 50% sparsity, followed by further iterative training and pruning steps to reach 70% sparsity.Â

The method begins with sparse pretraining on subsets of high-quality datasets such as SlimPajama and The Stack. The sparse pretraining process includes fine-tuning with per-layer distillation, ensuring the model retains high accuracy across various complex tasks, including chat, code generation, and instruction following. This detailed process involves training the 50% sparse model until convergence and then pruning it further to achieve the 70% target. The weights are pruned and frozen, and sparsity masks are enforced during training to maintain the desired sparsity levels. This iterative process is crucial for maintaining high recovery levels after fine-tuning.

The sparse models demonstrated the ability to achieve up to 70% sparsity while fully recovering accuracy for fine-tuning tasks. Training acceleration on Cerebras CS-3 chips closely matched theoretical scaling, showcasing the efficiency of the approach. Inference speeds increased significantly, with improvements of up to 3x on CPUs using Neural Magicâ€™s DeepSparse engine and 1.7x on GPUs using the nm-vllm engine. Additionally, the combination of sparsity and quantization resulted in total speedups on CPUs reaching up to 8.6x, highlighting the methodâ€™s efficiency and effectiveness.

The studyâ€™s results underscore the potential of combining sparsity with quantization to achieve dramatic speedups and performance gains. The sparse pretraining methodology proved particularly effective, demonstrating high recovery at up to 70% sparsity levels. The integration of Cerebrasâ€™s CS-3 AI accelerator for sparse pretraining further highlighted the advantages of this approach, enabling near-ideal speedups and significantly reducing computational requirements.

In conclusion, this research successfully addresses the challenge of reducing the computational demands of LLMs while maintaining their performance. The innovative sparse pretraining and deployment techniques introduced by the Neural Magic, Cerebras Systems, and IST Austria researchers offer a promising solution to the problem. This approach not only enhances the efficiency and accessibility of NLP models but also sets the stage for future advancements in the field.

Check out theÂ Paper and Model. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 42k+ ML SubReddit

The post Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

As Bill Gates says AI will replace white-collar jobs, one expert says two professions are on the chopping block

Hackers Exploited Krpano Framework Flaw to Inject Spam Ads on 350+ Websites

Ten Wild Examples of Llama 3.1 Use Cases

The Razer headset I haven’t stopped using since I reviewed it now has an Xbox version, and it’s predictably awesome

Handling Process Synchronization with Laravel Cache Locks

HCL Commerce V9.1 – The Power of the Next.js Ruby Storefront

CISA Adds Twilio Authy and IE Flaws to Exploited Vulnerabilities List

User Registration and Login using PHP and PostgreSQL

Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity

Related Posts