Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Breaking Down Barriers: Scaling Multimodal AI with CuMo

    Breaking Down Barriers: Scaling Multimodal AI with CuMo

    May 14, 2024

    The advent of large language models (LLMs) like GPT-4 has sparked excitement around enhancing them with multimodal capabilities to understand visual data alongside text. However, previous efforts to create powerful multimodal LLMs have faced challenges in scaling up efficiently while maintaining performance. To mitigate these issues, the researchers took inspiration from the mixture-of-experts (MoE) architecture, widely used to scale up LLMs by replacing dense layers with sparse expert modules.

    In the MoE approach, instead of passing inputs through a single large model, there are many smaller expert sub-models that each specialize on a subset of the data. A routing network determines which expert(s) should process each input example. It allows scaling up total model capacity in a more parameter-efficient way.

    In their approach (shown in Figure 2), CuMo, the researchers integrated sparse MoE blocks into the vision encoder and the vision-language connector of a multimodal LLM. This allows different expert modules to process different parts of the visual and text inputs in parallel rather than relying on a monolithic model to analyze everything.

    The key innovation is the concept of co-upcycling (Figure 3). Instead of training the sparse MoE modules from scratch, they are initialized from a pre-trained dense model before being fine-tuned. Co-upcycling provides a better initial point for the experts to specialize during training.  

    For training, CuMo employs a thoughtful three-stage training process:

    1) Pre-train just the vision-language connector on image-text data like LLaVA to align the modalities.

    2) Pre-finetune all model parameters jointly on caption data from ALLaVA to warm up the full system. 

    3) Finally, fine-tune with visual instruction data from datasets like VQAv2, GQA, and LLaVA-Wild, introducing the co-upcycled sparse MoE blocks along with auxiliary losses to balance the expert load and stabilize training. This comprehensive approach, integrating MoE sparsity into multimodal models through co-upcycling and careful training, allows CuMo to scale up efficiently compared to simply increasing model size. 

    The researchers evaluated CuMo models on a range of visual question-answering benchmarks like VQAv2 and GQA, as well as multimodal reasoning challenges such as MMMU and MathVista. Their models, as shown in Figure 1, trained solely on publicly available datasets, outperformed other state-of-the-art approaches within the same model size categories across the board. Even compact 7B parameter CuMo models matched or exceeded the performance of much larger 13B alternatives on many challenging tasks.

    These impressive results highlight the potential of sparse MoE architectures combined with co-upcycling to develop more capable yet efficient multimodal AI assistants. As the researchers have open-sourced their work, CuMo could pave the way for a new generation of AI systems that can seamlessly understand and reason about text, images, and beyond.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post Breaking Down Barriers: Scaling Multimodal AI with CuMo appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMicrosoft Researchers Introduce Syntheseus: A Machine Learning Benchmarking Python Library for End-to-End Retrosynthetic Planning
    Next Article Vidur: A Large-Scale Simulation Framework Revolutionizing LLM Deployment Through Cost Cuts and Increased Efficiency

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    知乎携手MongoDB为企业数据的安全可靠性保驾护航

    Databases

    A Beginner’s Guide to Observability in Cloud Native Applications

    Development

    CVE-2025-4178 – Xiaowei1118 Java Server Path Traversal Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Linux Data Recovery: How to Salvage Lost or Corrupted Files

    Learning Resources

    Highlights

    Linux

    ACS: Il Nuovo Server di Composizione di AMD Basato su Weston

    February 1, 2025

    AMD sembra aver rivolto la sua attenzione al desktop GNU/Linux, con il recente annuncio del…

    Crazy Evil Gang Targets Crypto with StealC, AMOS, and Angel Drainer Malware

    February 3, 2025

    padthv1 is an old-school polyphonic additive synthesizer

    April 10, 2025

    5 essential Linux terms every new user needs to know

    August 19, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.