VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

When textless natural language processing (NLP) initially emerged, the primary concept involved training a language model on sequences of learnable, discrete units instead of relying on transcribed text. This approach aimed to enable NLP tasks to be directly applicable to spoken utterances. Moreover, in the context of editing speech, a model would need to modify individual words or phrases to align with a target transcript while maintaining the original, unaltered content of the speech. Currently, researchers are exploring the potential of developing a unified model for zero-shot text-to-speech and speech editing, marking a significant step forward in the field.

Recent research by the University of Texas at Austin and Rembrand present VOICECRAFT, an NCLM based on Transformers that generates neural speech codec tokens for infilling using autoregressive conditioning on bidirectional context. Voicecraft accomplishes state-of-the-art (SotA) results in zero-shot TTS and speech editing. The researchers build their approach on a two-stage token rearrangement process, including a delayed stacking step and a causal masking step. The proposed method allows autoregressive generation with bidirectional context and applies to speech codec sequences; it is based on the causal masking methodology, which the successful causal masked multimodal model inspired in joint text-image modeling.Â

To further guarantee effective multi-codebook modeling, the team incorporates causal masking with delayed stacking as the suggested token rearrangement approach. The team created a unique, realistic, and difficult dataset called REALEDIT to test speech editing. With waveforms ranging from 5 seconds to 12 seconds in duration, REALEDIT includes 310 real-world voice editing samples collected from audiobooks, YouTube videos, and Spotify podcasts. The target transcripts are generated by editing the source speech transcripts to maintain their grammatical correctness and semantic coherence.Â

The dataset is structured to accommodate many editing scenarios, such as adding, removing, substituting, and modifying multiple spans at once, with modified text lengths varying from one word to sixteen words. Because of the recordingsâ€™ varied subject matter, accents, speaking styles, recording environments, and background noises, REALEDIT presents a greater challenge than popular speech synthesis assessment datasets like VCTK, LJSpeech, and LibriTTS, which offer audiobooks. Because of its diversity and realism, REALEDIT is a good barometer for the real-world applicability of voice editing models.Â

When compared to the previous SotA speech editing model on REALEDIT, VOICECRAFT performs far better in the subjective human listening tests. Most importantly, VOICECRAFTâ€™s edited speech sounds almost identical to the original, unaltered audio. The results show that VOICECRAFT performs better than strong baselines, such as replicated VALL-E and the well-known commercial model XTTS v2 when it comes to zero-shot TTS and doesnâ€™t require fine-tuning. The team used audiobooks and videos from YouTube in their dataset.

Despite VOICECRAFTâ€™s progress, the team highlights some limitations, such as:

The most notable occurrence during generation is the long periods of quiet followed by a scratching sound. The team accomplished this study by sampling many utterances and picking the shorter ones, but there should be more refined and effective ways.Â

Another critical issue concerning the security of AI is the question of how to watermark and identify synthetic speech. There has been a lot of focus on watermarking and deepfake detection recently and many great strides forward.

However, with the advent of more sophisticated models like VOICECRAFT, the team believes that safety researchers face new opportunities and hurdles. They have made all of their code and model weights publicly available to help with research into AI safety and speech synthesis.Â

Check out theÂ PaperÂ andÂ GitHub.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter with 24k+ membersâ€¦

Donâ€™t Forget to join ourÂ 40k+ ML SubReddit

The post VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

How AI Overviews Can Transform Your SEO Strategy

How to scan files manually for virus infection on Windows 11

CodiumAI PR-Agent: An AI-Powered Tool for Automated Pull Request Analysis, Feedback, Suggestions and More

Transcribe audio with Java using Universal-1

Waze drops Google Assistant on iOS, promises a better AI assistant soon

Big pig Marketing

Windows 11 Dev KB5058493 adds intelligent text actions in Click to Do for AMD & Intel Copilot + PCs

Inspirational Websites Roundup #60

VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

Related Posts