Camb AI Releases MARS5 TTS: A Novel Open Source Text to Speech Model for Insane Prosody

MARS5 TTS, a game changer in open-source text-to-speech systems, has been released by the Camb AI team. This innovative model offers exceptional prosodic control and voice cloning capabilities, requiring less than 5 seconds of audio input. The system employs a two-stage architecture consisting of a 750M Auto-Regressive (AR) model and a 450M Non-Auto-Regressive (NAR) model. MARS5 utilizes a BPE tokenizer, enabling precise control over punctuation, pauses, and stops, thus advancing the field of speech synthesis.

The modelâ€™s architecture follows a unique two-stage AR-NAR pipeline. In the initial stage, an autoregressive transformer model generates coarse (L0) encodec speech features from the input text and reference audio. Subsequently, these features, along with the text and reference, are refined using a multinomial Denoising Diffusion Probabilistic Model (DDPM) to produce the remaining encodec codebook values. Finally, a vocoder transforms the DDPM output into the final audio.

The AR component of MARS5 predicts L0 coarse tokens, which are then further refined by the NAR DDPM model. This refined output is processed by the vocoder to generate the final audio. The modelâ€™s training on raw audio in conjunction with byte-pair-encoded text allows for nuanced control over prosody through punctuation and capitalization. For instance, adding commas introduces pauses, while capitalizing words emphasizes them, providing a natural method for guiding the generated outputâ€™s prosody.

Compared to other leading language models like GPT and Gemini, MARS5 distinguishes itself through its specialized focus on text-to-speech synthesis and its unique AR-NAR architecture. While GPT and Gemini are primarily designed for text generation and understanding, MARS5 is optimized for producing high-quality, controllable speech output. Its use of DDPM in the NAR stage and the incorporation of prosodic control through text formatting sets it apart in speech synthesis.

MARS5 demonstrates impressive results in voice cloning and prosodic control. The system supports two inference modes: a fast â€œshallow cloneâ€ that doesnâ€™t require the reference audioâ€™s transcript, and a slower but higher-quality â€œdeep cloneâ€ that utilizes the prompt transcript. With just 5 seconds of audio and a text snippet, MARS5 can generate speech for diverse and challenging scenarios, including sports commentary and anime voiceovers, showcasing its versatility and effectiveness.

To use MARS5, a reference audio file between 2-12 seconds long, with 6-second samples yielding optimal results is provided. The system accepts text input with punctuation and capitalization for prosodic control. Users can perform a â€œdeep cloneâ€ for enhanced quality by providing the reference audioâ€™s transcript, though this process takes longer. MARS5â€™s ability to handle complex prosodic scenarios makes it suitable for various applications in entertainment, education, and accessibility.

MARS5 TTS represents a significant advancement in open-source text-to-speech technology. Its innovative architecture, combining AR and NAR models with DDPM, enables unprecedented control over speech synthesis. The systemâ€™s ability to clone voices with minimal input and generate high-quality, prosodically rich speech positions it as a valuable tool for developers and researchers in the field of artificial intelligence and speech technology.

Check out the Model and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generallyÂ available! [Advertisement]

The post Camb AI Releases MARS5 TTS: A Novel Open Source Text to Speech Model for Insane Prosody appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Build Confidence In Your UX Work

“Touch Grass without touching grass” with these hilarious (and very real) skins for Xbox, Steam Deck, laptop, phone, and more

Microsoft Teams will fix meeting chats for presenters with this small change

ChatGPT’s stunning new image generator is now free for everyone

Everything coming to Call of Duty: Black Ops 6 multiplayer with Season 3

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PECL Releases (03.11.2025)

Image Dimension Validation with Laravel’s dimensions Rule

“Touch Grass without touching grass” with these hilarious (and very real) skins for Xbox, Steam Deck, laptop, phone, and more

“Touch Grass without touching grass” with these hilarious (and very real) skins for Xbox, Steam Deck, laptop, phone, and more

Microsoft Teams will fix meeting chats for presenters with this small change

Everything coming to Call of Duty: Black Ops 6 multiplayer with Season 3

Camb AI Releases MARS5 TTS: A Novel Open Source Text to Speech Model for Insane Prosody

ruby-align is Baseline Newly available

February 2025 Baseline monthly digest

This AI Paper by Meta FAIR Introduces MoMa: A Modality-Aware Mixture-of-Experts Architecture for Efficient Multimodal Pre-training

Learn Redux and Redux Toolkit for State Management

Fixing 404 Responses for Versioned Images in Experience Edge

This AI Paper Introduces MAETok: A Masked Autoencoder-Based Tokenizer for Efficient Diffusion Models

You can access ChatGPT Search without an account now – here’s how

Enhancing Trust in Large Language Models: Fine-Tuning for Calibrated Uncertainties in High-Stakes Applications

Introducing Amazon Aurora DSQL

Must-Have Design Tools for Web Creators in 2024

Camb AI Releases MARS5 TTS: A Novel Open Source Text to Speech Model for Insane Prosody

Related Posts