Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This AI Paper Introduces BEST-STD (Spoken Term Detection): A Novel Bidirectional Mamba-Enhanced Speech Tokenization Framework for Efficient Spoken Term Detection

    This AI Paper Introduces BEST-STD (Spoken Term Detection): A Novel Bidirectional Mamba-Enhanced Speech Tokenization Framework for Efficient Spoken Term Detection

    November 27, 2024

    Spoken term detection (STD) is a critical area in speech processing, enabling the identification of specific phrases or terms in large audio archives. This technology is extensively used in voice-based searches, transcription services, and multimedia indexing applications. By facilitating the retrieval of spoken content, STD plays a pivotal role in improving the accessibility and usability of audio data, especially in domains like podcasts, lectures, and broadcast media.

    A significant challenge in spoken term detection is the effective handling of out-of-vocabulary (OOV) terms and the computational demands of existing systems. Traditional methods often depend on automatic speech recognition (ASR) systems, which are resource-intensive and prone to errors, particularly for short-duration audio segments or under variable acoustic conditions. Further, these methods need help accurately segment continuous speech, making identifying specific terms without context difficult.

    Existing approaches to STD include ASR-based techniques that use phoneme or grapheme lattices, as well as dynamic time warping (DTW) and acoustic word embeddings for direct audio comparisons. While these methods have their merits, they are limited by speaker variability, computational inefficiency, and challenges in processing large datasets. Current tools also need help generalizing to different datasets, especially for terms not encountered during training.

    Researchers from the Indian Institute of Technology Kanpur and imec – Ghent University have introduced a novel speech tokenization framework named BEST-STD. This approach encodes speech into discrete, speaker-agnostic semantic tokens, enabling efficient retrieval with text-based algorithms. By incorporating a bidirectional Mamba encoder, the framework generates highly consistent token sequences across different utterances of the same term. This method eliminates the need for explicit segmentation and handles OOV terms more effectively than previous systems.

    The BEST-STD system uses a bidirectional Mamba encoder, which processes audio input in both forward and backward directions to capture long-range dependencies. Each layer of the encoder projects audio data into high-dimensional embeddings, which are discretized into token sequences through a vector quantizer. The model employs a self-supervised learning approach, leveraging dynamic time warping to align utterances of the same term and create frame-level anchor-positive pairs. The system uses an inverted index for storing tokenized sequences, allowing for efficient retrieval by comparing token similarity. During training, the system generates consistent token representations, ensuring invariance to the speaker and acoustic variations.

    The BEST-STD framework demonstrated superior performance in evaluations conducted on the LibriSpeech and TIMIT datasets. Compared to traditional STD methods and state-of-the-art tokenization models like HuBERT, WavLM, and SpeechTokenizer, BEST-STD achieved significantly higher Jaccard similarity scores for token consistency, with unigram scores reaching 0.84 and bigram scores at 0.78. The system outperformed baselines on spoken content retrieval tasks in mean average precision (MAP) and mean reciprocal rank (MRR). For in-vocabulary terms, BEST-STD achieved MAP scores of 0.86 and MRR scores of 0.91 on the LibriSpeech dataset, while for OOV terms, the scores reached 0.84 and 0.90 respectively. These results underline the system’s ability to effectively generalize across different term types and datasets.

    Notably, the BEST-STD framework also excelled in retrieval speed and efficiency, benefiting from an inverted index for tokenized sequences. This approach reduced reliance on computationally intensive DTW-based matching, making it scalable for large datasets. The bidirectional Mamba encoder, in particular, proved more effective than transformer-based architectures due to its ability to model fine-grained temporal information critical for spoken term detection.

    In conclusion, the introduction of BEST-STD marks a significant advancement in spoken term detection. By addressing the limitations of traditional methods, this approach offers a robust & efficient solution for audio retrieval tasks. The use of speaker-agnostic tokens and a bidirectional Mamba encoder not only enhances performance but also ensures adaptability to diverse datasets. This framework demonstrates promise for real-world applications, paving the way for improved accessibility and searchability in audio processing.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

    🎙 🚨 ‘Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

    The post This AI Paper Introduces BEST-STD (Spoken Term Detection): A Novel Bidirectional Mamba-Enhanced Speech Tokenization Framework for Efficient Spoken Term Detection appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUse Amazon Bedrock Agents for code scanning, optimization, and remediation
    Next Article Quantum Neuromorphic Computing: Implementing Scalable Quantum Perceptrons

    Related Posts

    Machine Learning

    Detect hallucinations for RAG-based systems

    May 17, 2025
    Machine Learning

    Set up a custom plugin on Amazon Q Business and authenticate with Amazon Cognito to interact with backend systems

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Target Circle membership FAQ: Bonuses, extra deals, longer return time, 2-day shipping, and more

    Development

    CVE-2025-46734 – League Commonmark Attributes Extension Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Atomfall: Finding all of the Interchange entrance locations

    News & Updates

    Understanding Perceptible Information with The Role of Tactile Elements in Universal Design – 8

    Development

    Highlights

    Development

    Deep Learning in Healthcare: Challenges, Applications, and Future Directions

    May 28, 2024

    Biomedical data is increasingly complex, high-dimensional, and heterogeneous, encompassing sources such as electronic health records…

    FOSS Weekly #24.51: OBS Tip, New Linux Tools, Fun With Terminal, New Releases and More

    December 18, 2024

    Anthropic’s CEO says “we do not understand how our own AI creations work” — and yes, we should all be “alarmed” by that

    May 5, 2025
    Can’t access Microsoft 365? You’re not alone.

    Can’t access Microsoft 365? You’re not alone.

    April 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.