Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 5, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 5, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 5, 2025

      In MCP era API discoverability is now more important than ever

      June 5, 2025

      Google’s DeepMind CEO lists 2 AGI existential risks to society keeping him up at night — but claims “today’s AI systems” don’t warrant a pause on development

      June 5, 2025

      Anthropic researchers say next-generation AI models will reduce humans to “meat robots” in a spectrum of crazy futures

      June 5, 2025

      Xbox just quietly added two of the best RPGs of all time to Game Pass

      June 5, 2025

      7 reasons The Division 2 is a game you should be playing in 2025

      June 5, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Mastering TypeScript: How Complex Should Your Types Be?

      June 5, 2025
      Recent

      Mastering TypeScript: How Complex Should Your Types Be?

      June 5, 2025

      IDMC – CDI Best Practices

      June 5, 2025

      PWC-IDMC Migration Gaps

      June 5, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Google’s DeepMind CEO lists 2 AGI existential risks to society keeping him up at night — but claims “today’s AI systems” don’t warrant a pause on development

      June 5, 2025
      Recent

      Google’s DeepMind CEO lists 2 AGI existential risks to society keeping him up at night — but claims “today’s AI systems” don’t warrant a pause on development

      June 5, 2025

      Anthropic researchers say next-generation AI models will reduce humans to “meat robots” in a spectrum of crazy futures

      June 5, 2025

      Xbox just quietly added two of the best RPGs of all time to Game Pass

      June 5, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

    InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

    January 29, 2025

    Multimodal large language models (MLLMs) have emerged as a promising approach towards artificial general intelligence, integrating diverse sensing signals into a unified framework. However, MLLMs face substantial challenges in fundamental vision-related tasks, significantly underperforming compared to human capabilities. Critical limitations persist in object recognition, localization, and motion recall, presenting obstacles to comprehensive visual understanding. Despite ongoing research and scaling efforts, a clear pathway to achieving human-level visual comprehension remains elusive. The current work highlights the complexity of developing adaptive and intelligent multimodal systems that can interpret and reason across different sensory inputs with human-like precision and flexibility.

    Existing research on MLLMs has pursued multiple approaches to address visual understanding challenges. Current methodologies combine vision encoders, language models, and connectors through instruction tuning, enabling complex tasks like image description and visual query responses. Researchers have explored various dimensions including model architecture, model size, training corpus, and performance optimization. Video-capable MLLMs have shown capabilities in processing sequential visuals and comprehending spatiotemporal variations. However, existing methods face significant limitations in handling fine-grained visual tasks such as precise segmentation and temporal grounding, so two strategies have emerged to tackle these challenges: the pixel-to-sequence (P2S) methodology, and the pixel-to-embedding (P2E) approach.

    Researchers from Shanghai AI Laboratory, Nanjing University, and Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences have proposed a new version of InternVideo2.5, a novel approach to improve video MLLM through long and rich context (LRC) modeling. It addresses limitations in perceiving fine-grained video details and capturing complex temporal structures. The proposed method focuses on integrating dense vision task annotations into MLLMs using direct preference optimization and developing compact spatiotemporal representations through adaptive hierarchical token compression. The researchers aim to expand the model’s capabilities in video understanding, enabling more robust performance across various benchmarks.

    The proposed architecture presents a complex multimodal framework that integrates advanced video processing and language modeling techniques. The system uses dynamic video sampling, processing between 64 to 512 frames, with each 8-frame clip compressed to 128 tokens, resulting in 16 tokens per frame representation. Key architectural components include a Temporal Head based on CG-DETR architecture and a Mask Head utilizing SAM2’s pre-trained weights. For temporal processing, the framework utilizes InternVideo2 for video feature extraction, with query features processed through the language model. The system implements two-layer MLPs for positioning prompts and spatial input encoding into the multimodal language model to optimize spatiotemporal capabilities.

    InternVideo2.5 demonstrates remarkable performance across video understanding benchmarks in short and long video question-answering tasks. Compared to its base model InternVL2.5, the proposed approach shows significant improvements, with notable increases of over 3 points on MVBench and Perception Test for short video predictions. InternVideo2.5 exhibits superior performance in short-duration spatiotemporal understanding, compared to models like GPT4-o and Gemini-1.5-Pro. The Needle-In-The-Haystack (NIAH) evaluation further validates the model’s enhanced implicit memory capabilities, successfully showing superior recall in a complex 5,000-frame single-hop task.

    In conclusion, researchers introduced a new version of InternVideo2.5, a novel video MLLM designed to enhance perception and understanding through long and rich context (LRC) modeling. The method utilizes direct preference optimization to transfer dense visual annotations and adaptive hierarchical token compression for efficient spatiotemporal representation. The research highlights significant improvements in visual capabilities, including object tracking, and underscores the critical importance of multimodal context resolution in advancing MLLM performance. However, the study shows limitations such as high computational costs and the need for further research in extending context processing techniques, presenting exciting opportunities for future investigation in the multimodal AI field.


    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

    🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

    The post InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Comprehensive Guide to Concepts in Fine-Tuning of Large Language Models (LLMs)
    Next Article It’s time for design to think less and feel more

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 5, 2025
    Machine Learning

    Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

    June 5, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Is ChatGPT getting a “grown up mode” with fewer guardrails? CEO Sam Altman hints AGI, AI agents, and deep research as part of OpenAI’s roadmap for 2025

    Development

    CLI Experiments : Intro + Big Text

    Development

    8 CSS & JavaScript Snippets for Creating Animated Progress Bars

    Development

    Microsoft AI boss confirms development of “off-frontier” AI models, but they’ll be 3 or 6 months behind OpenAI: “Our strategy is to really play a very tight second”

    News & Updates

    Highlights

    poddl is a cross platform command line podcast downloader

    April 21, 2025

    poddl is a command line podcast downloader for batch downloading all, individual, or a range…

    OpenAI Releases a Technical Playbook for Enterprise AI Integration

    April 19, 2025

    Dopo 2 anni, la FSF si pronuncia a proposito della scelta di Red Hat di non rendere più pubblici i sorgenti

    February 25, 2025

    SentinelOne Appoints Alex Stamos as Chief Information Security Officer

    August 6, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.