Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»LLaVA-NeXT-Interleave: A Versatile Large Multimodal Model LMM that can Handle Settings like Multi-image, Multi-frame, and Multi-view

    LLaVA-NeXT-Interleave: A Versatile Large Multimodal Model LMM that can Handle Settings like Multi-image, Multi-frame, and Multi-view

    July 13, 2024

    Recent progress in Large Multimodal Models (LMMs) has demonstrated remarkable capabilities in various multimodal settings, moving closer to the goal of artificial general intelligence. By using large amounts of vision-language data, they enhance LLMs with visual abilities, by aligning vision encoders. However, most open-source LMMs have focused mainly on single-image scenarios, leaving the more complex multi-image scenarios mostly unexplored. This is important because many real-world applications use multi-image capabilities such as thorough multi-image analyses. Given the wide range of computer vision situations and data types, there is a strong need to develop a general framework for LMMs that can work effectively with multi-image, video, and 3D data.

    To address these issues, this paper discusses some related works. The first work is Interleaved Image-text data, which gives LMMs two key abilities: multimodal in-context learning (ICL) and instruction-following in real-world multi-image scenarios. Next, Interleaved LMMs, like the closed-source GPT-4V and Gemini, support real-world multi-image applications with top performance. The community has also created open-source LMMs with excellent multi-image skills using diverse public datasets. In the last related work, interleaved benchmarks, several high-quality benchmarks have been developed for various scenarios to evaluate these multi-image abilities of LMMs.

    Researchers from ByteDance, HKUST, CUHK, and NTU have proposed LLaVA-NeXT-Interleave, a versatile LMM that can handle various real-world settings such as Multi-image, Multi-frame (videos), Multi-view (3D) while maintaining the performance of the Multi-patch (single-image) performance. These four settings are collectively called M4. A high-quality training dataset, M4-Instruct, with 1177.6 samples is created to enhance LMMs with the M4 capabilities. This dataset covers 14 tasks and 41 datasets across these four domains. Using a single model, LLaVA-NeXT-Interleave shows top results in different multi-image tasks compared to previous state-of-the-art models, while still performing well with single images.

    The LLaVA-NeXT-Interleave model is tested on M4. The LLaVA-Interleave Bench is selected to cover a range of in- and out-of-domain tasks while evaluating multi-image. For video evaluation, the tests include NExTQA, MVBench, Video Detailed Description (VDD), and ActivityNet-QA (Act). The results for ActivityNet-QA include both accuracy and GPT scores. Additionally, the model is assessed on VideoChat-GPT (VCG) using five criteria: correctness of information, detail orientation, context understanding, temporal understanding, and consistency. For 3D evaluation, the tests include ScanQA and two tasks from 3D-LLM.

    The results for multi-image show that the average performance of LLaVA-NeXT-Interleave is better than earlier open-source models in in- and out-domain tests. After adding DPO, the proposed 7B model achieves top performance on the VDD and VideoChatGPT tests, outperforming the previous LLaVA-NeXTVideo (34B). The LLaVA-NeXT-Interleave only uses multi-view images to understand the 3D world and gets much higher scores in difficult 3D situations compared to 3D-LLM and Point-LLM. For single-image tasks, 307k (40%) of the original LLaVA-NeXT single-image data is added to the Multi-patch (single-image), making the model capable of handling these tasks.

    In conclusion, researchers have introduced LLaVA-NeXT-Interleave, a flexible LLM that can handle different real-world settings like multi-image, multi-frame (videos), and multi-view (3D). Researchers emphasized the potential of this model to improve and combine the capabilities of LMMs in various visual tasks. Extensive Experiments in this paper show that LLaVA-NeXT-Interleave sets new high standards in multi-image tasks and performs very well in single-image tasks. This work sets a new standard in the field, opening the door for future advancements in multimodal AI and complex visual understanding tasks.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 46k+ ML SubReddit

    The post LLaVA-NeXT-Interleave: A Versatile Large Multimodal Model LMM that can Handle Settings like Multi-image, Multi-frame, and Multi-view appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleNVIDIA Researchers Introduce MambaVision: A Novel Hybrid Mamba-Transformer Backbone Specifically Tailored for Vision Applications
    Next Article InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output

    Related Posts

    Machine Learning

    Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    May 16, 2025
    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    The AI Fix #8: Emergence, a rancid donkey, and the world’s funniest joke

    Development

    CVE-2022-42450 – HCL Domino Volt SVG Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Amazon Appstore will be effectively discontinued on Windows 11

    Operating Systems

    CVE-2025-3521 – “WordPress Team Members Stored Cross-Site Scripting”

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Linux

    Rilasciato Wine 10: Un Salto in Avanti per l’Emulazione di Applicazioni Windows su Sistemi GNU/Linux

    January 22, 2025

    Il team di Wine ha compiuto un significativo passo in avanti, ha reso disponibile l’ultima…

    Solo.io Launches Agent Gateway and Introduces Agent Mesh for Unified AI Connectivity

    April 24, 2025

    Elden Ring DLC: Miquella’s Great Rune use and effect in Shadow of the Erdtree

    June 26, 2024

    South Korea’s antitrust watchdog green lights Microsoft’s practice of bundling Copilot

    April 28, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.