MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data

Multi-modal Large Language Models (MLLMs) have various applications in visual tasks. MLLMs rely on the visual features extracted from an image to understand its content. When a low-resolution image containing fewer pixels is provided as input, it translates less information to these models to work with. Due to this limitation, these models often need to be more accurate to identify the objects, scenes, or actions in the image. This behavior of MLLMs affects their effectiveness in visual tasks.

Researchers from the Shanghai Jiaotong University, Shanghai AI Laboratory, and S-Lab, Nanyang Technological University have introduced a novel MLLM model, MG-LLaVA to address the limitations of current Multi-modal Large Language Models (MLLMs) in processing low-resolution images. The key challenge lies in enhancing these models to capture and utilize high-resolution and object-centric features for improved visual perception and comprehension.

Current MLLMs typically use pre-trained Large Language Models (LLMs) to process concatenated visual and language embeddings, with models like LLaVA adopting low-resolution images as inputs. While these models have shown promise, they rely on low-resolution inputs limiting their ability to process fine-grained details and recognize small objects in complex images. Researchers have proposed various enhancements to address this, including training on diverse datasets, using high-resolution images, and employing dynamic aspect ratios. However, these approaches often need the integration of object-level features and multi-granularity inputs, which are crucial for comprehensive visual understanding.

The proposed model, MG-LLaVA is an innovative MLLM that significantly improves visual processing by incorporating a multi-granularity vision flow. This includes low-resolution, high-resolution, and object-centric features, enhancing the modelâ€™s ability to capture fine-grained details and improve object recognition. The MG-LLaVA framework builds on the architecture of LLaVA that integrates a high-resolution visual encoder, a Conv-Gate fusion network for feature integration, and object-level features derived from bounding boxes identified by open-vocabulary detectors.

The MG-LLaVA architecture comprises two key components: the Multi-Granularity Vision Flow framework and a large language model. The Vision Flow framework processes images at different resolutions, using a CLIP-pretrained Vision Transformer (ViT) for low-resolution features and a CLIP-pretrained ConvNeXt for high-resolution features. To fuse these features effectively, the Conv-Gate fusion network aligns the featuresâ€™ channel widths and modulates semantic information, maintaining computational efficiency.

Object-level features are incorporated using Region of Interest (RoI) alignment to extract detailed features from identified bounding boxes, which are then concatenated with other visual tokens. This multi-granularity approach enhances the modelâ€™s ability to capture comprehensive visual details and integrate them with textual embeddings. MG-LLaVA is trained on publicly available multimodal data and fine-tuned with visual instruction tuning data.

Extensive evaluations across multiple benchmarks, including MMBench and SEEDBench, demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes. The model significantly improves perception and visual comprehension, surpassing models like GPT-4V and GeminiPro-V. The study also includes comprehensive ablation experiments, confirming the effectiveness of the object-level features and Conv-Gate fusion network.

In conclusion, MG-LLaVA addresses the limitations of current MLLMs by introducing a multi-granularity vision flow that effectively processes low-resolution, high-resolution, and object-centric features. This innovative approach significantly enhances the modelâ€™s visual perception and comprehension capabilities, demonstrating superior performance across various multimodal benchmarks.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Sophos Issues Hotfixes for Critical Firewall Flaws: Update to Prevent Exploitation

E3’s ESA has a new video game conference — Microsoft, Sony, Nintendo, and more are attending

Achieve ~2x speed-up in LLM inference with Medusa-1 on Amazon SageMaker AI

Medusa Ransomware Claims UK-based Defense Solutions Provider Chemring Group as Victim

Let small fires burn

Which programming languages are most popular now (and what does that even mean)?

AutoArchive Settings are Missing in Outlook: 3 Ways to Add Them

9 ChatGPT Prompts to Speed Up Your UX Research Prep

MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data

Related Posts