InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output

Large Language Models (LLMs) have made significant strides in recent years, prompting researchers to explore the development of Large Vision Language Models (LVLMs). These models aim to integrate visual and textual information processing capabilities. However, current open-source LVLMs face challenges in matching the versatility of proprietary models like GPT-4, Gemini Pro, and Claude 3. The primary obstacles include limited diversity in training data and difficulties in handling long-context input and output. Researchers are striving to enhance open-source LVLMsâ€™ ability to perform a wide range of vision-language comprehension and composition tasks, bridging the gap between open-source and closed-source leading paradigms in terms of versatility and performance across various benchmarks.

Researchers have made significant efforts to tackle the challenges in developing versatile LVLMs. These approaches include text-image conversation models, high-resolution image analysis techniques, and video understanding methods. For text-image conversations, most existing LVLMs focus on single-image multi-round interactions, with some extending to multi-image inputs. High-resolution image analysis has been tackled through two main strategies: high-resolution visual encoders and image patchification. Video understanding in LVLMs has employed techniques such as sparse sampling, temporal pooling, compressed video tokens, and memory banks.

Also, researchers have explored webpage generation, moving from simple UI-to-code transformations to more complex tasks using large vision-language models trained on synthetic datasets. However, these approaches often lack diversity and real-world applicability. To align model outputs with human preferences, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been adapted for multimodal LVLMs, focusing on reducing hallucinations and improving response quality.

Researchers from Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, SenseTime Group, and Tsinghua University have introduced InternLM-XComposer-2.5 (IXC-2.5), representing a significant advancement in LVLMs, offering versatility and long-context capabilities. This model excels in comprehension and composition tasks, including free-form text-image conversations, OCR, video understanding, article composition, and webpage crafting. IXC-2.5 supports a 24K interleaved image-text context window, extendable to 96K, enabling long-term human-AI interaction and content creation.

The model introduces three key comprehension upgrades: ultra-high resolution understanding, fine-grained video analysis, and multi-turn multi-image dialogue support. For composition tasks, IXC-2.5 incorporates additional LoRA parameters, enabling webpage creation and high-quality text-image article composition. The latter benefits from Chain-of-Thought and Direct Preference Optimization techniques to enhance content quality.

IXC-2.5 enhances its predecessorsâ€™ architecture with a ViT-L/14 Vision Encoder, InternLM2-7B Language Model, and Partial LoRA. It handles diverse inputs through a Unified Dynamic Image Partition strategy, processing images at 560Ã—560 resolution with 400 tokens per sub-image. The model employs a scaled identity strategy for high-resolution images and treats videos as concatenated frames. Multi-image inputs are handled with interleaved formatting. IXC-2.5 also supports audio input/output using Whisper for transcription and MeloTTS for speech synthesis. This versatile architecture enables effective processing of various input types and complex tasks.

IXC-2.5 demonstrates exceptional performance across various benchmarks. In video understanding, it outperforms open-source models in 4 out of 5 benchmarks, matching closed-source APIs. For structural high-resolution tasks, IXC-2.5 competes with larger models, excelling in form and table understanding. It significantly improves multi-image multi-turn comprehension, outperforming previous models by 13.8% on the MMDU benchmark. In general visual QA tasks, IXC-2.5 matches or surpasses both open-source and closed-source models, notably outperforming GPT-4V and Gemini-Pro on some challenges. For screenshot-to-code translation, IXC-2.5 even surpasses GPT-4V in average performance, showcasing its versatility and effectiveness across diverse multimodal tasks.

IXC-2.5 represents a significant advancement in Large Vision-Language Models, offering long-contextual input and output capabilities. This model excels in ultra-high resolution image analysis, fine-grained video comprehension, multi-turn multi-image dialogues, webpage generation, and article composition. Despite utilizing a modest 7B Large Language Model backend, IXC-2.5 demonstrates competitive performance across various benchmarks. This achievement paves the way for future research into more contextual multi-modal environments, potentially extending to long-context video understanding and interaction history analysis. Such advancements promise to enhance AIâ€™s capacity to assist humans in diverse real-world applications, marking a crucial step forward in multimodal AI technology.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 46k+ ML SubReddit

The post InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Create an RSS Feed using HTL

What is sudo in Linux and why is it so important?

RÃ¨gles mÃ©tier : guide complet avec des exemples pour une automatisation efficace

Hackers Increasingly Abusing Microsoft Graph API for Stealthy Malware Communications

Maximizing Your Dreamforce 2024 Experience: Perficientâ€™s Pro Tips

Whistleblowers criticize OpenAIâ€™s opposition to AI safety bill

Man sentenced to 7 years in prison for role in $50m internet scam

3 Questions: Inverting the problem of design

InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output

Related Posts