Taming Long Audio Sequences: Audio Mamba Achieves Transformer-Level Performance Without Self-Attention

Audio classification has evolved significantly with the adoption of deep learning models. Initially dominated by Convolutional Neural Networks (CNNs), this field has shifted towards transformer-based architectures, which offer improved performance and the ability to handle various tasks through a unified approach. Transformers surpass CNNs in performance, creating a paradigm shift in deep learning, especially for functions requiring extensive contextual understanding and handling diverse input data types.

The primary challenge in audio classification is the computational complexity associated with transformers, particularly due to their self-attention mechanism, which scales quadratically with the sequence length. This makes it inefficient for processing long audio sequences, necessitating alternative methods to maintain performance while reducing computational load. Addressing this issue is crucial for developing models that can efficiently handle audio dataâ€™s increasing volume and complexity in various applications, from speech recognition to environmental sound classification.

Currently, the most prominent method for audio classification is the Audio Spectrogram Transformer (AST). ASTs utilize self-attention mechanisms to capture the global context in audio data but suffer from high computational costs. State space models (SSMs) have been explored as a potential alternative, offering linear scaling with sequence length. SSMs, such as Mamba, have shown promise in language and vision tasks by replacing self-attention with time-varying parameters to capture global context more efficiently. Despite their success in other domains, SSMs have yet to be widely adopted in audio classification, presenting an opportunity for innovation in this area.

Researchers from the Korea Advanced Institute of Science and Technology introduced Audio Mamba (AuM), a novel self-attention-free model based on state space models for audio classification. This model processes audio spectrograms efficiently using a bidirectional approach to handle long sequences without the quadratic scaling associated with transformers. The AuM model aims to eliminate the computational burden of self-attention, leveraging SSMs to maintain high performance while improving efficiency. By addressing the inefficiencies of transformers, AuM offers a promising alternative for audio classification tasks.

Audio Mambaâ€™s architecture involves converting input audio waveforms into spectrograms, which are then divided into patches. These patches are transformed into embedding tokens and processed using bidirectional state space models. The model operates in both forward and backward directions, capturing the global context efficiently and maintaining linear time complexity, thus improving processing speed and memory usage compared to ASTs. The architecture incorporates several innovative design choices, such as the strategic placement of a learnable classification token in the middle of the sequence and the use of positional embeddings to enhance the modelâ€™s ability to understand the spatial structure of the input data.

Audio Mamba demonstrated competitive performance across various benchmarks, including AudioSet, VGGSound, and VoxCeleb. The model achieved comparable or better results than AST, particularly excelling in tasks involving long audio sequences. For example, in the VGGSound dataset, Audio Mamba outperformed AST with a substantial accuracy improvement of over 5%, achieving 42.58% accuracy compared to ASTâ€™s 37.25%. On the AudioSet dataset, AuM achieved a mean average precision (mAP) of 32.43%, surpassing ASTâ€™s 29.10%. These results highlight AuMâ€™s ability to deliver high performance while maintaining computational efficiency, making it a robust solution for various audio classification tasks.

The evaluation showed that AuM requires significantly less memory and processing time. For instance, during training with 20-second audio clips, AuM consumed memory equivalent to ASTâ€™s smaller model while delivering superior performance. Furthermore, AuMâ€™s inference time was 1.6 times faster than ASTâ€™s at a token count 4096, demonstrating its efficiency in handling long sequences. This reduction in computational requirements without compromising accuracy indicates that AuM is well-suited for real-world applications where resource constraints are a critical consideration.

In summary, the introduction of Audio Mamba marks a significant advancement in audio classification by addressing the limitations of self-attention in transformers. The modelâ€™s efficiency and competitive performance highlight its potential as a viable alternative for processing long audio sequences. Researchers believe that Audio Mambaâ€™s approach could pave the way for future audio and multimodal learning applications developments. The ability to handle lengthy audio is increasingly crucial, especially with the rise of self-supervised multimodal learning and generation that leverages in-the-wild data and automatic speech recognition. Furthermore, AuM could be employed in self-supervised learning setups like Audio Masked Auto Encoders or multimodal learning tasks such as Audio-Visual pretraining or Contrastive Language-Audio Pretraining, contributing to the advancement of the audio classification field.

Check out theÂ Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 43k+ ML SubReddit | Also, check out our AI Events Platform

The post Taming Long Audio Sequences: Audio Mamba Achieves Transformer-Level Performance Without Self-Attention appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

Taming Long Audio Sequences: Audio Mamba Achieves Transformer-Level Performance Without Self-Attention

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

The Impact of UI/UX Design on Usability

5 reasons why Chromebooks are the perfect laptop for most people

Windows 11â€™s Store uses Winget to update Win32 apps with custom installer (Discord, OBS)

wholesale hats | otto hat | bulk hats | wholesale caps

Get Knee Pain Treatment Without Surgery

IPTVnator – IPTV player

Microsoft delivers strong cloud & AI performance despite DeepSeek’s worries

19 Best Free and Open Source Linux Graphical Calculators

Taming Long Audio Sequences: Audio Mamba Achieves Transformer-Level Performance Without Self-Attention

Related Posts