VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

When textless natural language processing (NLP) initially emerged, the primary concept involved training a language model on sequences of learnable, discrete units instead of relying on transcribed text. This approach aimed to enable NLP tasks to be directly applicable to spoken utterances. Moreover, in the context of editing speech, a model would need to modify individual words or phrases to align with a target transcript while maintaining the original, unaltered content of the speech. Currently, researchers are exploring the potential of developing a unified model for zero-shot text-to-speech and speech editing, marking a significant step forward in the field.

Recent research by the University of Texas at Austin and Rembrand present VOICECRAFT, an NCLM based on Transformers that generates neural speech codec tokens for infilling using autoregressive conditioning on bidirectional context. Voicecraft accomplishes state-of-the-art (SotA) results in zero-shot TTS and speech editing. The researchers build their approach on a two-stage token rearrangement process, including a delayed stacking step and a causal masking step. The proposed method allows autoregressive generation with bidirectional context and applies to speech codec sequences; it is based on the causal masking methodology, which the successful causal masked multimodal model inspired in joint text-image modeling.Â

To further guarantee effective multi-codebook modeling, the team incorporates causal masking with delayed stacking as the suggested token rearrangement approach. The team created a unique, realistic, and difficult dataset called REALEDIT to test speech editing. With waveforms ranging from 5 seconds to 12 seconds in duration, REALEDIT includes 310 real-world voice editing samples collected from audiobooks, YouTube videos, and Spotify podcasts. The target transcripts are generated by editing the source speech transcripts to maintain their grammatical correctness and semantic coherence.Â

The dataset is structured to accommodate many editing scenarios, such as adding, removing, substituting, and modifying multiple spans at once, with modified text lengths varying from one word to sixteen words. Because of the recordingsâ€™ varied subject matter, accents, speaking styles, recording environments, and background noises, REALEDIT presents a greater challenge than popular speech synthesis assessment datasets like VCTK, LJSpeech, and LibriTTS, which offer audiobooks. Because of its diversity and realism, REALEDIT is a good barometer for the real-world applicability of voice editing models.Â

When compared to the previous SotA speech editing model on REALEDIT, VOICECRAFT performs far better in the subjective human listening tests. Most importantly, VOICECRAFTâ€™s edited speech sounds almost identical to the original, unaltered audio. The results show that VOICECRAFT performs better than strong baselines, such as replicated VALL-E and the well-known commercial model XTTS v2 when it comes to zero-shot TTS and doesnâ€™t require fine-tuning. The team used audiobooks and videos from YouTube in their dataset.

Despite VOICECRAFTâ€™s progress, the team highlights some limitations, such as:

The most notable occurrence during generation is the long periods of quiet followed by a scratching sound. The team accomplished this study by sampling many utterances and picking the shorter ones, but there should be more refined and effective ways.Â

Another critical issue concerning the security of AI is the question of how to watermark and identify synthetic speech. There has been a lot of focus on watermarking and deepfake detection recently and many great strides forward.

However, with the advent of more sophisticated models like VOICECRAFT, the team believes that safety researchers face new opportunities and hurdles. They have made all of their code and model weights publicly available to help with research into AI safety and speech synthesis.Â

Check out theÂ PaperÂ andÂ GitHub.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter with 24k+ membersâ€¦

Donâ€™t Forget to join ourÂ 40k+ ML SubReddit

The post VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Build Confidence In Your UX Work

I saw every Samsung QLED TV releasing in 2025 – these standout features had me hooked

Xbox Cloud Gaming seems to now support early access games, starting with South of Midnight

GameSir just showed off its G7 Pro “Xbox Elite” controller, and it looksspectacular

6 reasons why I think Microsoft should keep the ‘local account’ option in Windows 11

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PECL Releases (03.11.2025)

Feature Flags with Laravel Pennant

Microsoft launches new Copilot app on Windows 11 with o3 reasoning, screenshots tool

Microsoft launches new Copilot app on Windows 11 with o3 reasoning, screenshots tool

Xbox Cloud Gaming seems to now support early access games, starting with South of Midnight

GameSir just showed off its G7 Pro “Xbox Elite” controller, and it looksspectacular

VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

ruby-align is Baseline Newly available

February 2025 Baseline monthly digest

Revolutionizing Retail with RFID and MongoDB Atlas

Thanks to Nvidia, there’s a new generation of PCs coming, and they’ll be running Linux

The AI Fix #9: When AI detectors fail (spectacularly), and OpenAIâ€™s five steps to Skynet

Meta AI Introduces CoCoMix: A Pretraining Framework Integrating Token Prediction with Continuous Concepts

A Coding Guide to Build an Optical Character Recognition (OCR) App in Google Colab Using OpenCV and Tesseract-OCR

Damn Small Linux – Linux distro for older hardware

NVIDIA’s RTX 4060 is aging horribly as Indiana Jones and the Great Circle Nazi-punches your VRAM

Microsoft’s Patch Tuesday Fixes 63 Flaws, Including Two Under Active Exploitation

VoiceCraft: A Transformer-based Neural Codec Language Model (NCLM) that Achieves State-of-the-Art Performance on Speech Editing and Zero-Shot TTS

Related Posts