Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Advancing Scalable Text-to-Speech Synthesis: Llasa’s Transformer-Based Framework for Improved Speech Quality and Emotional Expressiveness

    Advancing Scalable Text-to-Speech Synthesis: Llasa’s Transformer-Based Framework for Improved Speech Quality and Emotional Expressiveness

    February 11, 2025

    Recent advancements in LLMs, such as the GPT series and emerging “o1” models, highlight the benefits of scaling training and inference-time computing. While scaling during training—by increasing model size and dataset volume—has been a well-established strategy, recent findings emphasize the advantages of inference-time scaling, where additional computational resources during testing improve output quality and task complexity handling. This principle has been widely explored in text-based models but remains underutilized in speech synthesis. Existing text-to-speech (TTS) systems often employ multi-stage architectures, combining LLMs with diffusion models or other processing modules, complicating scaling decisions. Unlike text models, which follow a standardized Transformer framework that allows systematic scaling investigations, TTS research has largely focused on architectural improvements rather than optimizing inference-time computation.

    A shift toward single-stage TTS architectures addresses the inefficiencies of multi-stage pipelines by directly modeling discrete speech tokens instead of relying on intermediate acoustic representations. This approach reduces complexity, enhances scalability, and enables large-scale training without significant memory constraints. Evaluations of such architectures demonstrate state-of-the-art performance in zero-shot speech synthesis, cross-lingual adaptation, and emotion preservation, surpassing traditional multi-stage models. Additionally, integrating scaling strategies improves ASR accuracy, bridging the gap between text- and speech-based LLM applications. By adopting a unified, compute-efficient framework, recent advancements in TTS align more closely with the scalable methodologies seen in text LLMs, enabling more flexible and high-quality speech synthesis solutions.

    Researchers from ASLP Lab at Northwestern Polytechnical University, University of Science and Technology Beijing, University of Surrey, Chinese University of Hong Kong, Hong Kong Baptist University, University of Rochester, and Shanghai Mobvoi Information Technology introduce Llasa, a Transformer-based TTS model aligned with standard LLM architectures. Scaling train-time computing improves speech naturalness and prosody, while inference-time computing, with speech understanding verifiers, enhances emotional expressiveness, timbre consistency, and content accuracy. Evaluations on several datasets show state-of-the-art results, and the model and code are publicly available to encourage further TTS research.

    The TTS framework aligns with the standard text LLM paradigm, using a tokenizer and a Transformer-based LLM. It employs Xcodec2, a speech tokenizer that encodes waveforms into discrete tokens and decodes them into high-quality audio. The model learns the joint distribution of text and speech tokens, optimizing the conditional probability of generating speech tokens based on text input. The speech tokenizer integrates semantic and acoustic features using a dual encoder system. The approach scales training data and model size to improve performance and evaluates train-time and inference-time compute strategies, focusing on text understanding and in-context learning capabilities.

    The study compares the proposed speech tokenizer with existing codecs and evaluates its performance in TTS systems. The speech tokenizer is tested against various models using metrics such as Word Error Rate (WER), Perceptual Evaluation of Speech Quality (PESQ), and speaker similarity (SPK SIM). Results show that the tokenizer performs well at low token rates, achieving better speech quality than other codecs. The TTS models, evaluated for their text understanding and in-context learning abilities, improve with scaling model size and training data. Inference-time compute scaling also enhances performance, balancing speaker similarity and transcription accuracy.

    In conclusion, the study introduces Llasa, a scalable TTS system that uses a single Transformer model and tokenizer, aligning with text-based LLMs. The study explores train-time and inference-time compute scaling, showing that larger models and datasets improve speech naturalness, prosody, and comprehension. Additionally, using speech understanding models as verifiers, inference-time scaling enhances speaker similarity, emotional expressiveness, and accuracy. Llasa’s experiments demonstrate state-of-the-art performance with strong zero-shot TTS capabilities. The authors release their models and training codes to encourage further research in the field.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post Advancing Scalable Text-to-Speech Synthesis: Llasa’s Transformer-Based Framework for Improved Speech Quality and Emotional Expressiveness appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper Explores Long Chain-of-Thought Reasoning: Enhancing Large Language Models with Reinforcement Learning and Supervised Fine-Tuning
    Next Article LLMDet: How Large Language Models Enhance Open-Vocabulary Object Detection

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Despite Microsoft’s ‘sneaky tactic,’ this discounted Surface Pro 11 costs the same as the Surface Pro 12-inch

    News & Updates

    CVE-2025-4074 – PHPGurukul Curfew e-Pass Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    What is Caps Lock Emoji and How to Use it on Windows 11

    Development

    Website Design Examples for Inspiration on Websitevice

    Development

    Highlights

    Microsoft accidentally breaks toggle to replace new Outlook on Windows 11

    March 16, 2025

    Microsoft’s new Outlook app started as an optional upgrade that was soon converted into a…

    5 ways to tell people what to do at work

    December 7, 2024

    Meet Open R1: The Full Open Reproduction of DeepSeek-R1, Challenging the Status Quo of Existing Proprietary LLMs

    January 26, 2025

    How JavaScript’s Temporal Proposal Will Change Date/Time Functions

    November 15, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.