NVIDIA AI Introduces MM-Embed: The First Multimodal Retriever Achieving SOTA Results on the Multimodal M-BEIR Benchmark

In the world of information retrieval, one of the most challenging tasks is to create a system that can seamlessly understand and retrieve relevant content across different formats, such as text and images, without losing accuracy. Most state-of-the-art retrieval models are still confined to a single modalityâ€”either text-to-text or image-to-image retrievalâ€”which limits their applicability in real-world scenarios where information comes in diverse formats. This limitation is particularly evident in complex applications, such as visual question answering or fashion image retrieval, where both text and images are needed to derive relevant answers. Therefore, the need for a universal multimodal retriever that can handle text, images, and their combinations effectively has never been greater. The key challenges include the inherent difficulty of cross-modal understanding and overcoming biases within individual modalities.

NVIDIA researchers have stepped up to address these challenges by introducing MM-Embed, the first multimodal retriever that has achieved state-of-the-art (SOTA) results on the multimodal M-BEIR benchmark and ranks among the top five retrievers on the text-only MTEB retrieval benchmark. MM-Embed aims to bridge the gap between multiple retrieval formats, allowing for a more fluid search experience that spans both text and image-based content. The researchers fine-tuned MM-Embed using a multimodal large language model (MLLM) as a bi-encoder retriever across 16 retrieval tasks and ten datasets, demonstrating its versatility. Unlike other existing retrievers, MM-Embed does not restrict itself to a single type of data but instead supports complex user queries that may be composed of both text and images. Furthermore, the introduction of modality-aware hard negative mining plays a crucial role in enhancing MM-Embedâ€™s retrieval quality by minimizing the biases commonly seen in MLLMs.

The technical implementation of MM-Embed involved a series of key strategies designed to maximize retrieval performance. The model uses a bi-encoder architecture to fine-tune the retrieval process, leveraging modality-aware hard negative mining to mitigate biases that arise when handling mixed-modality data. In simple terms, this mining approach helps the model focus more accurately on the target modalityâ€”whether text, image, or a combinationâ€”thus improving its ability to handle difficult, interleaved text-image queries. Additionally, MM-Embed undergoes continual fine-tuning to boost its text retrieval capabilities without sacrificing its strength in multimodal tasks. This makes it particularly effective in a diverse set of scenarios, from retrieving Wikipedia paragraphs in response to a text-based query about an image to finding similar images based on complex descriptions.

This advancement is significant for several reasons. First, MM-Embed sets a new benchmark for multimodal retrieval with an average retrieval accuracy of 52.7% across all M-BEIR tasks, surpassing previous state-of-the-art models. When it comes to specific domains, MM-Embed showed notable improvements, such as a retrieval accuracy (R@5) of 73.8% for the MSCOCO dataset, indicating its strong ability to understand complex image captions. Moreover, by employing zero-shot reranking using multimodal LLMs, MM-Embed further enhanced retrieval precision in cases involving intricate text-image queries, such as visual question answering and composed image retrieval tasks. Notably, MM-Embed improved ranking accuracy in CIRCOâ€™s composed image retrieval task by more than 7 points, showcasing the efficacy of prompting LLMs for reranking in challenging, real-world scenarios.

In conclusion, MM-Embed represents a major leap forward in multimodal retrieval. By effectively integrating and enhancing both text and image retrieval capabilities, it paves the way for more versatile and sophisticated search engines capable of handling the varied ways people seek information in todayâ€™s digital landscape.

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members

The post NVIDIA AI Introduces MM-Embed: The First Multimodal Retriever Achieving SOTA Results on the Multimodal M-BEIR Benchmark appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

NVIDIA AI Introduces MM-Embed: The First Multimodal Retriever Achieving SOTA Results on the Multimodal M-BEIR Benchmark

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

The best Black Friday Sam’s Club deals 2024: Sales available now

Top Flutter Alternatives for Cross-Platform App Development

Digital Marketing Legend “Srinidhi Ranganathan” Warns: What’s Ahead of AI May Be Worse Than a Recession

AI Automation Testing: A Deep Dive into Katalon Studioâ€™s StudioAssist Feature

Cyber Incident Shuts Down North Miami City Hall: What You Need to Know

These Google Pixel buds have replaced over-ear headphones for me when traveling – here’s why

8 Million Android Users Hit by SpyLoan Malware in Loan Apps on Google Play

Can’t afford the buzzy battery toaster from CES? Try these alternatives instead

NVIDIA AI Introduces MM-Embed: The First Multimodal Retriever Achieving SOTA Results on the Multimodal M-BEIR Benchmark

Related Posts