Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Microsoft and CMU Researchers Propose a Machine Learning Method to Train an AAC (Automated Audio Captioning) System Using Only Text

    Microsoft and CMU Researchers Propose a Machine Learning Method to Train an AAC (Automated Audio Captioning) System Using Only Text

    April 12, 2024

    Automated Audio Captioning (AAC) is an innovative field that translates audio streams into descriptive natural language text. Creating AAC systems hinges on vast, accurately annotated audio-text data availability. However, the traditional method of manually pairing audio segments with text captions is not only costly and labor-intensive but also prone to inconsistencies and biases, which restricts the scalability of AAC technologies.

    Existing research in AAC includes encoder-decoder architectures that utilize audio encoders like PANN, AST, and HTSAT to extract audio features. These features are interpreted by language generation components such as BART and GPT-2. The CLAP model advances this by using contrastive learning to align audio and text data in multimodal embeddings. Techniques like adversarial training and contrastive losses refine AAC systems, enhancing caption diversity and accuracy while addressing vocabulary limitations inherent in earlier models.

    Microsoft and Carnegie Mellon University researchers have proposed an innovative text-only training methodology for AAC systems using the CLAP model. This novel approach circumvents the need for audio data during training by leveraging text data alone, fundamentally altering the traditional AAC training process. It allows the system to generate audio captions without directly learning from audio inputs, thus presenting a significant shift in AAC technology.

    The researchers employed the CLAP framework to exclusively train AAC systems using text data for methodology. During training, captions are generated by a decoder conditioned on embeddings from a CLAP text encoder. At inference, the text encoder is substituted with a CLAP audio encoder to adapt the system for actual audio inputs. The model is evaluated on two prominent datasets, AudioCaps and Clotho, utilizing a mix of Gaussian noise injection and a lightweight learnable adapter to effectively bridge the modality gap between text and audio embeddings, ensuring the system’s performance remains robust.

    The evaluation of the text-only AAC methodology demonstrated robust results. Specifically, the model achieved a SPIDEr score of 0.456 on the AudioCaps dataset and 0.255 on the Clotho dataset, showcasing competitive performance with state-of-the-art AAC systems trained with paired audio-text data. Moreover, using the Gaussian noise injection and the learnable adapter, the model bridged the modality gap effectively, evidenced by the minimization of the variance in embeddings to approximately 0.015. These quantitative outcomes validate the effectiveness of the proposed text-only training approach in generating accurate and relevant audio captions.

    To conclude, the research presents a text-only training method for AAC using the CLAP model, eliminating the dependency on audio-text pairs. The methodology leverages text data to train AAC systems, demonstrated by achieving competitive SPIDEr scores on the AudioCaps and Clotho datasets. This approach simplifies AAC system development, enhances scalability, and reduces dependency on costly data annotation processes. Such innovations in AAC training can significantly broaden the application and accessibility of audio captioning technologies.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 40k+ ML SubReddit

    Want to get in front of 1.5 Million AI Audience? Work with us here

    The post Microsoft and CMU Researchers Propose a Machine Learning Method to Train an AAC (Automated Audio Captioning) System Using Only Text appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLLM2Vec: A Simple AI Approach to Transform Any Decoder-Only LLM into a Text Encoder Achieving SOTA Performance on MTEB in the Unsupervised and Supervised Category
    Next Article Advancements in Multilingual Large Language Models: Innovations, Challenges, and Impact on Global Communication and Computational Linguistics

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4837 – Projectworlds Student Project Allocation System SQL Injection Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best Samsung Galaxy S25 deals: $200 gift cards and free offers at T-Mobile and Verizon

    News & Updates

    CodexGraph: An Artificial Intelligence AI System that Integrates LLM Agents with Graph Database Interfaces Extracted from Code Repositories

    Development

    Cactus ransomware: what you need to know

    Development

    CVE-2025-46549 – YesWiki Reflected Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-1331 – IBM CICS TX Buffer Overflow Vulnerability

    May 8, 2025

    CVE ID : CVE-2025-1331

    Published : May 8, 2025, 10:15 p.m. | 1 hour, 22 minutes ago

    Description : IBM CICS TX Standard 11.1 and IBM CICS TX Advanced 10.1 and 11.1 could allow a local user to execute arbitrary code on the system due to the use of unsafe use of the gets function.

    Severity: 7.8 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Hiring Kit: Computer Hardware Engineer

    April 18, 2025

    Freespire – Ubuntu-based Linux distribution

    January 24, 2025

    Ukraine Detains Suspects Behind Bot Farms and Kremlin’s Propaganda Machinery

    June 14, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.