Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

    VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

    April 8, 2024

    When textless natural language processing (NLP) initially emerged, the primary concept involved training a language model on sequences of learnable, discrete units instead of relying on transcribed text. This approach aimed to enable NLP tasks to be directly applicable to spoken utterances. Moreover, in the context of editing speech, a model would need to modify individual words or phrases to align with a target transcript while maintaining the original, unaltered content of the speech. Currently, researchers are exploring the potential of developing a unified model for zero-shot text-to-speech and speech editing, marking a significant step forward in the field.

    Recent research by the University of Texas at Austin and Rembrand present VOICECRAFT, an NCLM based on Transformers that generates neural speech codec tokens for infilling using autoregressive conditioning on bidirectional context. Voicecraft accomplishes state-of-the-art (SotA) results in zero-shot TTS and speech editing. The researchers build their approach on a two-stage token rearrangement process, including a delayed stacking step and a causal masking step. The proposed method allows autoregressive generation with bidirectional context and applies to speech codec sequences; it is based on the causal masking methodology, which the successful causal masked multimodal model inspired in joint text-image modeling. 

    To further guarantee effective multi-codebook modeling, the team incorporates causal masking with delayed stacking as the suggested token rearrangement approach. The team created a unique, realistic, and difficult dataset called REALEDIT to test speech editing. With waveforms ranging from 5 seconds to 12 seconds in duration, REALEDIT includes 310 real-world voice editing samples collected from audiobooks, YouTube videos, and Spotify podcasts. The target transcripts are generated by editing the source speech transcripts to maintain their grammatical correctness and semantic coherence. 

    The dataset is structured to accommodate many editing scenarios, such as adding, removing, substituting, and modifying multiple spans at once, with modified text lengths varying from one word to sixteen words. Because of the recordings’ varied subject matter, accents, speaking styles, recording environments, and background noises, REALEDIT presents a greater challenge than popular speech synthesis assessment datasets like VCTK, LJSpeech, and LibriTTS, which offer audiobooks. Because of its diversity and realism, REALEDIT is a good barometer for the real-world applicability of voice editing models. 

    When compared to the previous SotA speech editing model on REALEDIT, VOICECRAFT performs far better in the subjective human listening tests. Most importantly, VOICECRAFT’s edited speech sounds almost identical to the original, unaltered audio. The results show that VOICECRAFT performs better than strong baselines, such as replicated VALL-E and the well-known commercial model XTTS v2 when it comes to zero-shot TTS and doesn’t require fine-tuning. The team used audiobooks and videos from YouTube in their dataset.

    Despite VOICECRAFT’s progress, the team highlights some limitations, such as:

    The most notable occurrence during generation is the long periods of quiet followed by a scratching sound. The team accomplished this study by sampling many utterances and picking the shorter ones, but there should be more refined and effective ways. 

    Another critical issue concerning the security of AI is the question of how to watermark and identify synthetic speech. There has been a lot of focus on watermarking and deepfake detection recently and many great strides forward.

    However, with the advent of more sophisticated models like VOICECRAFT, the team believes that safety researchers face new opportunities and hurdles. They have made all of their code and model weights publicly available to help with research into AI safety and speech synthesis. 

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter with 24k+ members…

    Don’t Forget to join our 40k+ ML SubReddit

    The post VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleKnowledge Bases for Amazon Bedrock now supports metadata filtering to improve retrieval accuracy
    Next Article Extracting hydrogen from rocks

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    How AI Overviews Can Transform Your SEO Strategy

    Development

    How to scan files manually for virus infection on Windows 11

    News & Updates

    CodiumAI PR-Agent: An AI-Powered Tool for Automated Pull Request Analysis, Feedback, Suggestions and More

    Development

    Transcribe audio with Java using Universal-1

    Artificial Intelligence

    Highlights

    Waze drops Google Assistant on iOS, promises a better AI assistant soon

    March 31, 2025

    Waze has officially discontinued support for Google Assistant on its iOS app, citing persistent integration…

    Big pig Marketing

    January 24, 2025

    Windows 11 Dev KB5058493 adds intelligent text actions in Click to Do for AMD & Intel Copilot + PCs

    May 13, 2025

    Inspirational Websites Roundup #60

    May 24, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.