Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 15, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 15, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 15, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 15, 2025

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025

      Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

      May 15, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      A cross-platform Markdown note-taking application

      May 15, 2025
      Recent

      A cross-platform Markdown note-taking application

      May 15, 2025

      AI Assistant Demo & Tips for Enterprise Projects

      May 15, 2025

      Celebrating Global Accessibility Awareness Day (GAAD)

      May 15, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025
      Recent

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens

    This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens

    June 19, 2024

    Transformer-based Large Language Models (LLMs) have emerged as the backbone of Natural Language Processing (NLP). These models have shown remarkable performance over a variety of NLP tasks. The creative self-attention mechanism that enables effective all-to-all communication between tokens in a sequence is primarily responsible for their success. Transformers have become a leading NLP research tool because of this approach and its capacity to expand both model and dataset sizes.

    However, self-attention layers are not without restrictions, especially when working with lengthy sequences. The self-attention computational load grows quadratically with the sequence length during training. A large key-value cache is required to hold the state since the memory demand at inference time increases linearly with the number of previous tokens. Numerous attempts have been made to optimize self-attention layers in response to these efficiency difficulties. Still, these attempts are not up to the language modeling power of conventional self-attention.

    Selective state-space models (SSMs) such as Mamba solve some of the fundamental limitations associated with Transformers. Because of the key-value cache, transformers have quadratic computational complexity in relation to sequence length and high memory requirements during inference. SSMs provide a better, more effective solution by reducing these problems. Recent studies have shown that SSMs can compete with Transformers, if not outperform them, in language modeling tasks, making them a reasonable alternative.

    Previous studies comparing SSMs and Transformers have mostly focused on small-scale trials using models with less than 3 billion parameters and training on datasets smaller than 1 trillion tokens, despite the good results. A team of researchers has recently performed a thorough comparison using 8-billion-parameter models of Mamba, Mamba-2, and Transformers, all trained on datasets up to 3.5 trillion tokens, in order to properly comprehend the performance of these architectures at greater sizes. 

    The team has also incorporated an 8-billion-parameter hybrid model, called Mamba-2-Hybrid that consists of 50% MLP layers, 7% self-attention, and 43% Mamba-2. To find out if Mamba models could compete with Transformer models when given more training resources, the team evaluated them across a wide range of natural language tasks. The results showed that on several tasks, pure SSM models, including Mamba and Mamba-2, either matched or outperformed Transformers. 

    However, these models failed on tasks that required considerable long-context reasoning and tasks that required strong copying or in-context learning, like the five-shot MMLU and Phonebook Lookup tasks. On all 12 assessed standard tasks, the 8-billion-parameter Mamba-2-Hybrid model outperformed the 8-billion-parameter Transformer, with an average improvement of 2.65 points. During inference, the hybrid model demonstrated the capacity to generate tokens up to eight times faster.

    The team has expanded their studies to incorporate versions of the Mamba-2-Hybrid and Transformer models that allow sequence lengths of 16K, 32K, and 128K in order to evaluate long-context capabilities further. The hybrid model continued to perform on par with or better than the Transformer on average across 23 additional long-context tasks.  As part of NVIDIA’s Megatron-LM project, the team has released code.

    Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 44k+ ML SubReddit

    The post This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleTopicGPT: A Prompt-based AI Framework that Uses Large Language Models (LLMs) to Uncover Latent Topics in a Text Collection
    Next Article Enhancing Mathematical Reasoning in LLMs: Integrating Monte Carlo Tree Search with Self-Refinement

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4732 – TOTOLINK A3002R/A3002RU HTTP POST Request Handler Buffer Overflow

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps Integration

    Machine Learning

    How to Measure Usability Score

    Development

    OpenAI updates GPT-4o, reclaiming its crown for best AI model

    Development

    I never thought I’d love a triangular PC gaming headset, but these RGB-lit wireless cans are pretty great

    News & Updates

    Highlights

    Gwyddion – SPM data visualization and analysis

    April 22, 2025

    Gwyddion is a modular program for SPM (scanning probe microscopy) data visualization and analysis. The…

    Mapping the misuse of generative AI

    May 13, 2025

    Learn A-Level Computer Science Concepts

    February 13, 2025

    Design Principles Every Junior Designer Should Master

    June 10, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.