Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Listening-While-Speaking Language Model (LSLM): An End-to-End System Equipped with both Listening and Speaking Channels

    Listening-While-Speaking Language Model (LSLM): An End-to-End System Equipped with both Listening and Speaking Channels

    August 7, 2024

    In the realm of human-computer interaction (HCI), dialogue stands out as the most natural form of communication. The advent of speech language models (SLMs) has significantly enhanced speech-based conversational AI, yet these models remain constrained to turn-based interactions, limiting their applicability in real-time scenarios. This gap in real-time interaction presents a significant challenge, particularly in situations requiring immediate feedback and dynamic conversational flow. The inability to handle interruptions and maintain seamless interaction has spurred researchers to explore full duplex modeling (FDM) in interactive speech language models (iSLM). Addressing this challenge, the research introduces the Listening-while-Speaking Language Model (LSLM), an innovative design to enable real-time, uninterrupted interaction by integrating listening and speaking capabilities within a single system.

    Current methods in speech-language models typically involve turn-based systems, where listening and speaking occur in isolated phases. These systems often employ separate automatic speech recognition (ASR) and text-to-speech (TTS) modules, leading to latency issues and an inability to handle real-time interruptions effectively. Notable models like SpeechGPT and LauraGPT have advanced conversational AI. Yet, they remain limited to these turn-based paradigms, unable to provide the fluid interaction required for more natural human-computer dialogue.

    To overcome these limitations, a team of researchers from Shanghai Jiao Tong University and ByeDance propose the LSLM, an end-to-end system designed to simultaneously perform both listening and speaking. This model employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. The LSLM’s unique approach lies in its ability to fuse these channels, enabling it to detect real-time turn-taking and respond dynamically. By exploring three fusion strategies—early fusion, middle fusion, and late fusion—the researchers identified middle fusion as the optimal balance between speech generation and real-time interaction capabilities.

    The LSLM’s architecture revolves around its dual-channel design. For speaking the model utilizes an autoregressive token-based TTS system. Unlike previous models that rely on autoregressive and non-autoregressive approaches, the LSLM simplifies this by using discrete audio tokens, enhancing real-time interaction, and eliminating the need for extensive processing before speech synthesis. The speaking channel generates speech tokens based on the given context with a vocoder, which converts these tokens into audible speech. This setup allows the model to focus more on semantic information, improving the clarity and relevance of its responses.

    On the listening side, the model employs a streaming SSL encoder to process incoming audio signals continuously. This encoder converts audio input into continuous embeddings and then projects it into a space that can be processed alongside the speaking tokens. These channels are integrated through one of three fusion methods, with middle fusion emerging as the most effective. In this method, the listening and speaking channels are merged at each Transformer block, allowing the model to leverage both channels’ information throughout the speech generation process. This fusion strategy ensures the LSLM can handle interruptions smoothly and maintain a coherent and responsive dialogue flow.

    Performance evaluation of the LSLM was conducted under two experimental settings: command-based FDM and voice-based FDM. In the command-based scenario, the model was tested on its ability to respond to specific commands amidst background noise. In contrast, the voice-based scenario evaluated its sensitivity to interruptions from various speakers. The results demonstrated the LSLM’s robustness to noisy environments and ability to recognize and adapt to new voices and instructions. The middle fusion strategy, in particular, balanced the demands of real-time interaction and speech generation, providing a seamless user experience.

    The Listening-while-Speaking Language Model (LSLM) represents a significant leap forward in interactive speech-language models. By addressing the limitations of turn-based systems and introducing a robust, real-time interaction capability, the LSLM paves the way for more natural and fluid human-computer dialogues. The research highlights the importance of integrating full duplex capabilities into SLMs, showcasing how such advancements can enhance the applicability of conversational AI in real-world scenarios. Through its innovative design and impressive performance, the LSLM sets a new standard for future developments in speech-based HCI.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 47k+ ML SubReddit

    Find Upcoming AI Webinars here

    Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

    The post Listening-While-Speaking Language Model (LSLM): An End-to-End System Equipped with both Listening and Speaking Channels appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleComparing Taipy’s Callbacks and Streamlit’s Caching: A Detailed Technical Analysis
    Next Article Navigating Explainable AI in In Vitro Diagnostics: Compliance and Transparency Under European Regulations

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48187 – RAGFlow Authentication Bypass

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    LWiAI #195 – OpenAI o3 & for-profit, DeepSeek-V3, Latent Space

    Artificial Intelligence

    Microsoft Researchers Introduce MatterSim: A Deep-Learning Model for Materials Under Real-World Conditions

    Development

    CVE-2025-47657 – Productive Minds Productive Commerce SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Sustainability Week at Figma

    Web Development

    Highlights

    Development

    ToyMaker Uses LAGTOY to Sell Access to CACTUS Ransomware Gangs for Double Extortion

    April 26, 2025

    Cybersecurity researchers have detailed the activities of an initial access broker (IAB) dubbed ToyMaker that…

    Build your multilingual personal calendar assistant with Amazon Bedrock and AWS Step Functions

    July 3, 2024

    This new fully encrypted messenger app is serious about privacy

    August 13, 2024

    OS-Genesis: A Novel GUI Data Synthesis Pipeline that Reverses the Conventional Trajectory Collection Process

    January 4, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.