Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images

    Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images

    March 21, 2025

    ​Artificial intelligence has made significant strides in recent years, yet integrating real-time speech interaction with visual content remains a complex challenge. Traditional systems often rely on separate components for voice activity detection, speech recognition, textual dialogue, and text-to-speech synthesis. This segmented approach can introduce delays and may not capture the nuances of human conversation, such as emotions or non-speech sounds. These limitations are particularly evident in applications designed to assist visually impaired individuals, where timely and accurate descriptions of visual scenes are essential.​

    Addressing these challenges, Kyutai has introduced MoshiVis, an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images. Building upon their earlier work with Moshi—a speech-text foundation model designed for real-time dialogue—MoshiVis extends these capabilities to include visual inputs. This enhancement allows users to engage in fluid conversations about visual content, marking a noteworthy advancement in AI development.

    Technically, MoshiVis augments Moshi by integrating lightweight cross-attention modules that infuse visual information from an existing visual encoder into Moshi’s speech token stream. This design ensures that Moshi’s original conversational abilities remain intact while introducing the capacity to process and discuss visual inputs. A gating mechanism within the cross-attention modules enables the model to selectively engage with visual data, maintaining efficiency and responsiveness. Notably, MoshiVis adds approximately 7 milliseconds of latency per inference step on consumer-grade devices, such as a Mac Mini with an M4 Pro Chip, resulting in a total of 55 milliseconds per inference step. This performance stays well below the 80-millisecond threshold for real-time latency, ensuring smooth and natural interactions.

    In practical applications, MoshiVis demonstrates its ability to provide detailed descriptions of visual scenes through natural speech. For instance, when presented with an image depicting green metal structures surrounded by trees and a building with a light brown exterior, MoshiVis articulates:​

    “I see two green metal structures with a mesh top, and they’re surrounded by large trees. In the background, you can see a building with a light brown exterior and a black roof, which appears to be made of stone.”

    This capability opens new avenues for applications such as providing audio descriptions for the visually impaired, enhancing accessibility, and enabling more natural interactions with visual information. By releasing MoshiVis as an open-source project, Kyutai invites the research community and developers to explore and expand upon this technology, fostering innovation in vision-speech models. The availability of the model weights, inference code, and visual speech benchmarks further supports collaborative efforts to refine and diversify the applications of MoshiVis.

    In conclusion, MoshiVis represents a significant advancement in AI, merging visual understanding with real-time speech interaction. Its open-source nature encourages widespread adoption and development, paving the way for more accessible and natural interactions with technology. As AI continues to evolve, innovations like MoshiVis bring us closer to seamless integration of multimodal understanding, enhancing user experiences across various domains.


    Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

    The post Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
    Next Article NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Accelerating and Scaling AI Reasoning Models in AI Factories

    Related Posts

    Machine Learning

    Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    May 16, 2025
    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    The Kindle Scribe that easily replaced my Remarkable E Ink tablet is $85 off right now

    News & Updates

    Accelerate digital pathology slide annotation workflows on AWS using H-optimus-0

    Machine Learning

    Laravel Error Solutions on the Default Exception Page

    Development

    Leverage Coroutines in Android with Concurrency Essentials [FREE]

    Development

    Highlights

    Development

    Unveiling Critical Batch Size Dynamics: How Data and Model Scaling Impact Efficiency in Large-Scale Language Model Training with Innovative Optimization Techniques

    November 25, 2024

    Large-scale model training focuses on improving the efficiency and scalability of neural networks, especially in…

    Salesforce Field Encryption to Secure Data

    May 31, 2024

    CVE-2025-4805 – WatchGuard Fireware OS Stored XSS Vulnerability

    May 16, 2025

    How to Better Protect Your Intellectual Property

    February 11, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.