In a stunning announcement reverberating through the tech world, Kyutai introduced Moshi, a revolutionary real-time native multimodal foundation model. This innovative model mirrors and surpasses some of the functionalities showcased by OpenAI’s GPT-4o in May.
Moshi is designed to understand and express emotions, offering capabilities like speaking with different accents, including French. It can listen and generate audio and speech while maintaining a seamless flow of textual thoughts, as it says. One of Moshi’s standout features is its ability to handle two audio streams simultaneously, allowing it to listen and talk simultaneously. This real-time interaction is underpinned by joint pre-training on a mix of text and audio, leveraging synthetic text data from Helium, a 7 billion parameter language model developed by Kyutai.
The fine-tuning process of Moshi involved 100,000 “oral-style†synthetic conversations, converted using Text-to-Speech (TTS) technology. The model’s voice was trained on synthetic data generated by a separate TTS model, achieving an impressive end-to-end latency of 200 milliseconds. Remarkably, Kyutai has also developed a smaller variant of Moshi that can run on a MacBook or a consumer-sized GPU, making it accessible to a broader range of users.
Kyutai has emphasized the importance of responsible AI use by incorporating watermarking to detect AI-generated audio, a feature that is currently a work in progress. The decision to release Moshi as an open-source project highlights Kyutai’s commitment to transparency and collaborative development within the AI community.
At its core, Moshi is powered by a 7-billion-parameter multimodal language model that processes speech input and output. The model operates with a two-channel I/O system, generating text tokens and audio codecs concurrently. The base text language model, Helium 7B, was trained from scratch and then jointly trained with text and audio codecs. Based on Kyutai’s in-house Mimi model, the speech codec boasts a 300x compression factor, capturing semantic and acoustic information.
Training Moshi involved rigorous processes, fine-tuning 100,000 highly detailed transcripts annotated with emotion and style. The Text-to-Speech Engine, which supports 70 different emotions and styles, was fine-tuned on 20 hours of audio recorded by a licensed voice talent named Alice. The model is designed for adaptability and can be fine-tuned with less than 30 minutes of audio.
Moshi’s deployment showcases its efficiency. The demo model, hosted on Scaleway and Hugging Face platforms, can handle two batch sizes at 24 GB VRAM. It supports various backends, including CUDA, Metal, and CPU, and benefits from optimizations in inference code through Rust. Enhanced KV caching and prompt caching are anticipated to improve performance further.
Looking ahead, Kyutai has ambitious plans for Moshi. The team intends to release a technical report and open model versions, including the inference codebase, the 7B model, the audio codec, and the full optimized stack. Future iterations, such as Moshi 1.1, 1.2, and 2.0, will refine the model based on user feedback. Moshi’s licensing aims to be as permissive as possible, fostering widespread adoption and innovation.
In conclusion, Moshi exemplifies the potential of small, focused teams to achieve extraordinary advancements in AI technology. This model opens up new avenues for research assistance, brainstorming, language learning, and more, demonstrating the transformative power of AI when deployed on-device with unparalleled flexibility. As an open-source model, it invites collaboration and innovation, ensuring that the benefits of this groundbreaking technology are accessible to all.
Check out the Announcement, Keynote, and Demo Chat. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.Â
Paper, Code, and Model are coming…
Join our Telegram Channel and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 46k+ ML SubReddit
The post Kyutai Open Sources Moshi: A Real-Time Native Multimodal Foundation AI Model that can Listen and Speak appeared first on MarkTechPost.
Source: Read MoreÂ