Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Google’s Agent2Agent protocol finds new home at the Linux Foundation

      June 23, 2025

      Decoding The SVG path Element: Curve And Arc Commands

      June 23, 2025

      This week in AI dev tools: Gemini 2.5 Pro and Flash GA, GitHub Copilot Spaces, and more (June 20, 2025)

      June 20, 2025

      Gemini 2.5 Pro and Flash are generally available and Gemini 2.5 Flash-Lite preview is announced

      June 19, 2025

      Best early Prime Day Nintendo Switch deals: My 17 favorite sales live now

      June 23, 2025

      How I use VirtualBox to run any OS on my Mac – including Linux

      June 23, 2025

      Apple will give you a free pair of AirPods when you buy a MacBook or iPad for school – here’s who’s eligible

      June 23, 2025

      How Apple’s biggest potential acquisition ever could perplex AI rivals like Google

      June 23, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Music Streaming Platform using PHP and MySQL

      June 23, 2025
      Recent

      Music Streaming Platform using PHP and MySQL

      June 23, 2025

      Solutions That Benefit Everyone – Why Inclusive Design Matters for All

      June 23, 2025

      Reducing Barriers Across Industries Through Inclusive Design

      June 23, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 Installation Assistant Download: 2025 Guide

      June 23, 2025
      Recent

      Windows 11 Installation Assistant Download: 2025 Guide

      June 23, 2025

      Didn’t Receive Gears of War: Reloaded Code? Explainer

      June 23, 2025

      Fix Vibrant Visuals Greyed Out in Minecraft Bedrock

      June 23, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

    LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

    May 6, 2025

    Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

    Overview of the LLaMA-Omni2 Architecture

    LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

    • Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
    • Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
    • Core LLM: The Qwen2.5 models serve as the main reasoning engine.
    • Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

    A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

    Streaming Generation with Read-Write Scheduling

    The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

    Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

    Training Approach

    Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

    Training is executed in two stages:

    • Stage I: Independently optimizes the speech-to-text and text-to-speech modules.
    • Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

    Benchmark Results

    The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

    Model Llama Q (S2S) Web Q (S2S) GPT-4o Score ASR-WER Latency (ms)
    GLM-4-Voice (9B) 50.7 15.9 4.09 3.48 1562.8
    LLaMA-Omni (8B) 49.0 23.7 3.52 3.67 346.7
    LLaMA-Omni2-7B 60.7 31.3 4.15 3.26 582.9

    The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

    Component Analyses

    • Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
    • TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
    • Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

    Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

    Conclusion

    LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.


    Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

    Here’s a brief overview of what we’re building at Marktechpost:

    ML News Community – r/machinelearningnews (92k+ members)

    Newsletter– airesearchinsights.com/(30k+ subscribers)

    miniCON AI Events – minicon.marktechpost.com

    AI Reports & Magazines – magazine.marktechpost.com

    AI Dev & Research News – marktechpost.com (1M+ monthly readers)

    The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOWASP Top 10 Vulnerabilities: A Guide for QA Testers
    Next Article Use custom metrics to evaluate your generative AI application with Amazon Bedrock

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 23, 2025
    Machine Learning

    New AI Framework Evaluates Where AI Should Automate vs. Augment Jobs, Says Stanford Study

    June 23, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4254 – PCMan FTP Server Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Moldovan Police Arrest Suspect in €4.5M Ransomware Attack on Dutch Research Agency

    Development

    Law Enforcement Takes Down Botnet Made Up of Thousands of End-Of-Life Routers

    Development

    CVE-2025-28168 – Outsystems Unrestricted File Upload Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Machine Learning

    Cost-effective AI image generation with PixArt-Σ inference on AWS Trainium and AWS Inferentia

    May 14, 2025

    PixArt-Sigma is a diffusion transformer model that is capable of image generation at 4k resolution.…

    CVE-2025-44192 – SourceCodester Simple Barangay Management System SQL Injection Vulnerability

    April 30, 2025

    CVE-2025-28200 – Victure RX1800 Default Password Weakness

    May 9, 2025

    The Best AI Directory for Showcasing Your AI Tools

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.