Breaking Down Barriers: Scaling Multimodal AI with CuMo

The advent of large language models (LLMs) like GPT-4 has sparked excitement around enhancing them with multimodal capabilities to understand visual data alongside text. However, previous efforts to create powerful multimodal LLMs have faced challenges in scaling up efficiently while maintaining performance. To mitigate these issues, the researchers took inspiration from the mixture-of-experts (MoE) architecture, widely used to scale up LLMs by replacing dense layers with sparse expert modules.

In the MoE approach, instead of passing inputs through a single large model, there are many smaller expert sub-models that each specialize on a subset of the data. A routing network determines which expert(s) should process each input example. It allows scaling up total model capacity in a more parameter-efficient way.

In their approach (shown in Figure 2), CuMo, the researchers integrated sparse MoE blocks into the vision encoder and the vision-language connector of a multimodal LLM. This allows different expert modules to process different parts of the visual and text inputs in parallel rather than relying on a monolithic model to analyze everything.

The key innovation is the concept of co-upcycling (Figure 3). Instead of training the sparse MoE modules from scratch, they are initialized from a pre-trained dense model before being fine-tuned. Co-upcycling provides a better initial point for the experts to specialize during training.Â Â

For training, CuMo employs a thoughtful three-stage training process:

1) Pre-train just the vision-language connector on image-text data like LLaVA to align the modalities.

2) Pre-finetune all model parameters jointly on caption data from ALLaVA to warm up the full system.Â

3) Finally, fine-tune with visual instruction data from datasets like VQAv2, GQA, and LLaVA-Wild, introducing the co-upcycled sparse MoE blocks along with auxiliary losses to balance the expert load and stabilize training. This comprehensive approach, integrating MoE sparsity into multimodal models through co-upcycling and careful training, allows CuMo to scale up efficiently compared to simply increasing model size.Â

The researchers evaluated CuMo models on a range of visual question-answering benchmarks like VQAv2 and GQA, as well as multimodal reasoning challenges such as MMMU and MathVista. Their models, as shown in Figure 1, trained solely on publicly available datasets, outperformed other state-of-the-art approaches within the same model size categories across the board. Even compact 7B parameter CuMo models matched or exceeded the performance of much larger 13B alternatives on many challenging tasks.

These impressive results highlight the potential of sparse MoE architectures combined with co-upcycling to develop more capable yet efficient multimodal AI assistants. As the researchers have open-sourced their work, CuMo could pave the way for a new generation of AI systems that can seamlessly understand and reason about text, images, and beyond.

Check out theÂ Paper and GitHub.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 42k+ ML SubReddit

The post Breaking Down Barriers: Scaling Multimodal AI with CuMo appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

Breaking Down Barriers: Scaling Multimodal AI with CuMo

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

What is DevSecOps and Why is it Essential for Secure Software Delivery?

Critical Flaw in Rockwell Automation Devices Allows Unauthorized Access

â€˜Olympics Has Fallenâ€™ â€“ Russian Government Attempts to Discredit 2024 Paris Olympics

Kingdom Come: Deliverance 2 pays homage to an iconic Elden Ring hero with this amazing easter egg — here’s where you’ll find it

Texting while driving? AI traffic cameras are watching you in these 5 states

This AI Paper by DeepSeek-AI Introduces DeepSeek-V2: Harnessing Mixture-of-Experts for Enhanced AI Performance

backdown – a deduplicator

New Front-End Features For Designers In 2025

Breaking Down Barriers: Scaling Multimodal AI with CuMo

Related Posts