This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens

Transformer-based Large Language Models (LLMs) have emerged as the backbone of Natural Language Processing (NLP). These models have shown remarkable performance over a variety of NLP tasks. The creative self-attention mechanism that enables effective all-to-all communication between tokens in a sequence is primarily responsible for their success. Transformers have become a leading NLP research tool because of this approach and its capacity to expand both model and dataset sizes.

However, self-attention layers are not without restrictions, especially when working with lengthy sequences. The self-attention computational load grows quadratically with the sequence length during training. A large key-value cache is required to hold the state since the memory demand at inference time increases linearly with the number of previous tokens. Numerous attempts have been made to optimize self-attention layers in response to these efficiency difficulties. Still, these attempts are not up to the language modeling power of conventional self-attention.

Selective state-space models (SSMs) such as Mamba solve some of the fundamental limitations associated with Transformers. Because of the key-value cache, transformers have quadratic computational complexity in relation to sequence length and high memory requirements during inference. SSMs provide a better, more effective solution by reducing these problems. Recent studies have shown that SSMs can compete with Transformers, if not outperform them, in language modeling tasks, making them a reasonable alternative.

Previous studies comparing SSMs and Transformers have mostly focused on small-scale trials using models with less than 3 billion parameters and training on datasets smaller than 1 trillion tokens, despite the good results. A team of researchers has recently performed a thorough comparison using 8-billion-parameter models of Mamba, Mamba-2, and Transformers, all trained on datasets up to 3.5 trillion tokens, in order to properly comprehend the performance of these architectures at greater sizes.Â

The team has also incorporated an 8-billion-parameter hybrid model, called Mamba-2-Hybrid that consists of 50% MLP layers, 7% self-attention, and 43% Mamba-2. To find out if Mamba models could compete with Transformer models when given more training resources, the team evaluated them across a wide range of natural language tasks. The results showed that on several tasks, pure SSM models, including Mamba and Mamba-2, either matched or outperformed Transformers.Â

However, these models failed on tasks that required considerable long-context reasoning and tasks that required strong copying or in-context learning, like the five-shot MMLU and Phonebook Lookup tasks. On all 12 assessed standard tasks, the 8-billion-parameter Mamba-2-Hybrid model outperformed the 8-billion-parameter Transformer, with an average improvement of 2.65 points. During inference, the hybrid model demonstrated the capacity to generate tokens up to eight times faster.

The team has expanded their studies to incorporate versions of the Mamba-2-Hybrid and Transformer models that allow sequence lengths of 16K, 32K, and 128K in order to evaluate long-context capabilities further. The hybrid model continued to perform on par with or better than the Transformer on average across 23 additional long-context tasks.Â As part of NVIDIAâ€™s Megatron-LM project, the team has released code.

Check out theÂ Paper and Code. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 44k+ ML SubReddit

The post This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

A cross-platform Markdown note-taking application

A cross-platform Markdown note-taking application

AI Assistant Demo & Tips for Enterprise Projects

Celebrating Global Accessibility Awareness Day (GAAD)

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-4732 – TOTOLINK A3002R/A3002RU HTTP POST Request Handler Buffer Overflow

Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps Integration

How to Measure Usability Score

OpenAI updates GPT-4o, reclaiming its crown for best AI model

I never thought I’d love a triangular PC gaming headset, but these RGB-lit wireless cans are pretty great

Gwyddion – SPM data visualization and analysis

Mapping the misuse of generative AI

Learn A-Level Computer Science Concepts

Design Principles Every Junior Designer Should Master

This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens

Related Posts