Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»SpeechVerse: A Multimodal AI Framework that Enables LLMs to Follow Natural Language Instructions for Performing Diverse Speech-Processing Tasks

    SpeechVerse: A Multimodal AI Framework that Enables LLMs to Follow Natural Language Instructions for Performing Diverse Speech-Processing Tasks

    May 18, 2024

    Large language models (LLMs) have excelled in natural language tasks and instruction following, yet they struggle with non-textual data like images and audio. Incorporating speech comprehension could vastly improve human-computer interaction. Current methods rely on automated speech recognition (ASR) followed by LLM processing, missing non-textual cues. A promising approach integrates textual LLMs with speech encoders in one training setup. This allows for a more comprehensive understanding of both speech and text, promising richer comprehension compared to text-only methods. Particularly, instruction-following multimodal audio-language models are gaining traction due to their ability to generalize across tasks. While previous works like SpeechT5, Whisper, VIOLA, SpeechGPT, and SLM show promise, they are constrained to a limited range of speech tasks.

    Multi-task learning involves leveraging shared representations across diverse tasks to enhance generalization and efficiency. Models like T5 and SpeechNet employ this approach for text and speech tasks, achieving significant results. However, multimodal large language models integrating audio have garnered less attention. Recent efforts like SpeechGPT and Qwen-Audio aim to bridge this gap, showcasing capabilities in various audio tasks. SpeechVerse innovatively combines multi-task learning and instruction finetuning to achieve superior performance in audio-text tasks.

    Amazon researchers introduce SpeechVerse, a multi-task framework with supervised instruction finetuning for diverse speech tasks. Unlike SpeechGPT, it utilizes continuous representations from pre-trained speech models for text-only output tasks. In comparison to Qwen-Audio, which requires hierarchical tagging and a large-scale audio encoder, SpeechVerse incorporates multi-task learning and finetuning without task-specific tagging, enabling generalization to unseen tasks through natural language instructions.

    The multimodal model architecture of SpeechVerse comprises an audio encoder, a convolution downsampling module, and an LLM. The audio encoder extracts semantic features from audio using a pre-trained model, generating a unified representation. The downsampling module adjusts the audio features for compatibility with LLM token sequences. The LLM processes text and audio input, combining downsampled audio features with token embeddings. Curriculum learning with parameter-efficient finetuning optimizes training, freezing pre-trained components to efficiently handle diverse speech tasks.

    The evaluation of end-to-end trained joint speech and language models (E2E-SLM) using the SpeechVerse framework covers 11 tasks spanning various domains and datasets. ASR benchmarks reveal the efficacy of SpeechVerse’s core speech understanding, with task-specific pre-trained ASR models showing promising results. For SLU tasks, end-to-end trained models outperform cascaded pipelines in most cases, demonstrating the effectiveness of SpeechVerse. SpeechVerse models also exhibit competitive or superior performance compared to state-of-the-art models across diverse tasks like ASR, ST, IC, SF, and ER.

    To recapitulate, SpeechVerse is introduced by Amazon researchers,  a multimodal framework enabling LLMs to execute diverse speech processing tasks through natural language instructions. Utilizing supervised instruction finetuning and combining representations from pre-trained speech and text models, SpeechVerse exhibits strong zero-shot generalization on unseen tasks. Comparative analysis against conventional baselines underscores SpeechVerse’s superior performance on 9 out of 11 tasks, showcasing its robust instruction-following capability. The model demonstrates resilience across out-of-domain datasets, unseen prompts, and novel tasks, highlighting the effectiveness of the proposed training approach in fostering generalizability.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post SpeechVerse: A Multimodal AI Framework that Enables LLMs to Follow Natural Language Instructions for Performing Diverse Speech-Processing Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Research from Google DeepMind Explores the Performance Gap between Online and Offline Methods for AI Alignment
    Next Article Innovating CSS Animations with Modern Math Capabilities

    Related Posts

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4818 – SourceCodester Doctor’s Appointment System SQL Injection

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Meta AI Introduces MR.Q: A Model-Free Reinforcement Learning Algorithm with Model-Based Representations for Enhanced Generalization

    Machine Learning

    CVE-2025-47490 – Rustaurius Ultimate WP Mail SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    8 ways to protect your privacy on Linux and keep your data safe

    News & Updates

    Balance Like a Pro: How to Prioritize and Stay in Control, a Conversation with Lina Jaramillo

    Development
    Hostinger

    Highlights

    CVE-2025-24344 – CtrlX OS Cross-Site Scripting (XSS)

    April 30, 2025

    CVE ID : CVE-2025-24344

    Published : April 30, 2025, 12:15 p.m. | 39 minutes ago

    Description : A vulnerability in the error notification messages of the web application of ctrlX OS allows a remote unauthenticated attacker to inject arbitrary HTML tags and, possibly, execute arbitrary client-side code in the context of another user’s browser via a crafted HTTP request.

    Severity: 6.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Amazon’s Kindle download deadline is in two days — Here’s how I saved my ebooks

    February 25, 2025

    The Risk and Reward of Connected Cars 

    August 8, 2024

    Massive Data Breach in Tamil Nadu: 600,000 Migrant Workers’ Data Allegedly Leaked on Dark Web

    June 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.