Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 14, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 14, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 14, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 14, 2025

      I test a lot of AI coding tools, and this stunning new OpenAI release just saved me days of work

      May 14, 2025

      How to use your Android phone as a webcam when your laptop’s default won’t cut it

      May 14, 2025

      The 5 most customizable Linux desktop environments – when you want it your way

      May 14, 2025

      Gen AI use at work saps our motivation even as it boosts productivity, new research shows

      May 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025
      Recent

      Strategic Cloud Partner: Key to Business Success, Not Just Tech

      May 14, 2025

      Perficient’s “What If? So What?” Podcast Wins Gold at the 2025 Hermes Creative Awards

      May 14, 2025

      PIM for Azure Resources

      May 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025
      Recent

      Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

      May 14, 2025

      You can now share an app/browser window with Copilot Vision to help you with different tasks

      May 14, 2025

      Microsoft will gradually retire SharePoint Alerts over the next two years

      May 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Insight-V: Empowering Multi-Modal Models with Scalable Long-Chain Reasoning

    Insight-V: Empowering Multi-Modal Models with Scalable Long-Chain Reasoning

    November 25, 2024

    The capability of multimodal large language models (MLLMs) to enable complex long-chain reasoning that incorporates text and vision raises an even greater barrier in the realm of artificial intelligence. While text-centric reasoning tasks are being gradually advanced, multimodal tasks add additional challenges rooted in the lack of rich, comprehensive reasoning datasets and efficient training strategies. Currently, many models tend to lack accuracy in their reasoning when exposed to complex data involving images, which limits their application to real-world applications in systems with autonomous activity, medical diagnoses, or learning materials.

    Traditional methods for enhancing reasoning capacity rely largely on Chain-of-Thought (CoT) prompting or structured datasets. However, these approaches have significant drawbacks. Crafting annotated datasets for the task of visual reasoning is very resource-intensive and requires enormous human resources. Reasoning and summarizing in a single step often results in badly fragmented or just plain bizarre reasoning chains. Moreover, with the lack of datasets and the direct approach to training these systems, they cannot generalize effectively across a variety of tasks. These constraints call for new methodologies to be developed to amplify the reasoning capability of multi-modal artificial intelligence systems.

    Researchers from NTU, Tencent, Tsinghua University, and Nanjing University introduced Insight-V to tackle these challenges through a unique combination of scalable data generation and a multi-agent framework. It offers an incremental methodology for generating diversified and coherent reasoning pathways through a multi-granularity methodology of pathway evaluation to ensure the quality of the generated pathways. A distinct multi-agent system decomposes tasks into two specialized roles: the reasoning agent, which generates detailed logical steps, and the summary agent, which validates and refines these outputs for accuracy. By leveraging Iterative Direct Preference Optimization (DPO), a reinforcement learning method, the system achieves alignment with human-like judgment. This collaborative architecture enables significant advances in reasoning accuracy and task-specific performance.

    Insight-V has a structured dataset of more than 200K reasoning samples and over 1.2 million summarization examples obtained from related benchmarks such as LLaVA-NeXT and other curated data for training. The reasoning agent aims to give finalized step-by-step processes for solving logical problems, while the summary agent critically evaluates and polishes these steps to reduce errors. Training starts with role-specific supervised fine-tuning, gradually moving to iterative preference optimization, refining the output to be closer to actual human decision-making. This style of training maintains a structured approach toward robust generalization across domains and complex reasoning tasks.

    Multi-modal reasoning performance improvement of the system on benchmark tasks is major with a mean relative improvement of 7.0% over LLaVA-NeXT and 2.9% from the baseline model. Insight-V improves performance over tasks such as chart-oriented detailed analysis and mathematical reasoning besides generalization capability in perception-focused evaluation modules like TextVQA. This is the reason for steady performance improvement across these tasks that validates the utility and merit of the system, hence its placement firmly as a hallmark development in multi-modal reasoning models. 

    Insight-V offers a transformative framework for addressing key challenges in multi-modal reasoning by integrating innovative data generation techniques with a collaborative multi-agent architecture.  Improved reasoning over structured datasets, task-specific decomposition, and reinforcement learning optimizations are significant contributions in the context. This work ensures that MLLMs will indeed face reasoning-intensive tasks effectively while being versatile across different domains. In that regard, Insight-V serves as the basic basis for further development toward building systems that utilize complex reasoning within challenging visual-linguistic environments.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

    [FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

    The post Insight-V: Empowering Multi-Modal Models with Scalable Long-Chain Reasoning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper Proposes NLRL: A Natural Language-Based Paradigm for Enhancing Reinforcement Learning Efficiency and Interpretability
    Next Article Researchers at the University of Tokyo Propose FlexFlood: A Data Updating Algorithm that Ensures Fast Search Even if Data Distribution Changes

    Related Posts

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47785 – Emlog SQL Injection and Remote Code Execution

    May 15, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47928 – Spotify/Github Spotipy Untrusted Code Execution Vulnerability

    May 15, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    4 steps to building a natural language search tool

    News & Updates

    Kirk and Spock reunite: AI gives us the Star Trek farewell we always wanted

    Development

    From Weeks to Days – How NG-TxAutomate Shrinks Automation Timelines

    Development

    Unleashing Potential: Why Linux Reigns Supreme in Web Development

    Development

    Highlights

    ERROR_EVALUATION_EXPIRATION [BSoD Fix]

    January 23, 2025

    The ERROR_EVALUATION_EXPIRATION with error code 622 (0x26E) with description {Windows Evaluation Notification} The evaluation period…

    Understanding the visual knowledge of language models

    June 17, 2024

    Spring Boot Logging with ELK: A Configuration Guide

    July 1, 2024

    CVE-2025-46535 – AlphaEfficiencyTeam Custom Login and Registration Missing Authorization Vulnerability

    April 25, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.