Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 19, 2025

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 19, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 19, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 19, 2025

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025

      Microsoft Copilot gets OpenAI’s GPT-4o image generation support — but maybe a day late and a dollar short for the hype?

      May 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      ES6: Set Vs Array- What and When?

      May 19, 2025
      Recent

      ES6: Set Vs Array- What and When?

      May 19, 2025

      Transform JSON into Typed Collections with Laravel’s AsCollection::of()

      May 19, 2025

      Deployer

      May 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025
      Recent

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»DeepStack: Enhancing Multimodal Models with Layered Visual Token Integration for Superior High-Resolution Performance

    DeepStack: Enhancing Multimodal Models with Layered Visual Token Integration for Superior High-Resolution Performance

    June 12, 2024

    Most LMMs integrate vision and language by converting images into visual tokens fed as sequences into LLMs. While effective for multimodal understanding, this method significantly increases memory and computation demands, especially with high-resolution photos or videos. Various techniques, like spatial grouping and token compression, aim to reduce the number of visual tokens but often compromise on detailed visual information. Despite these efforts, the fundamental approach remains the same: visual tokens are transformed into a 1D sequence and input into LLMs, inherently increasing processing overhead.

    Researchers from Fudan University and Microsoft have developed “DeepStack,” a new architecture for LMMs. Instead of feeding a long sequence of visual tokens into the language model’s first layer, DeepStack distributes these tokens across multiple layers, aligning each group with a corresponding layer. This bottom-to-top approach enhances the model’s ability to process complex visual inputs without increasing computational costs. After testing the LLaVA-1.5 and LLaVA-Next models, DeepStack shows significant performance gains across various benchmarks, particularly in high-resolution tasks, and can handle more tokens efficiently than traditional methods.

    Recent advancements in LLMs like BERT, T5, and GPT have revolutionized natural language processing (NLP) using transformers and pretraining-then-finetuning strategies. These models excel in various tasks, from text generation to question answering. Simultaneously, LMMs like CLIP and Flamingo effectively integrate vision and language by aligning them in a shared semantic space. However, handling high-resolution images and complex visual inputs remains challenging due to high computational costs. The new “DeepStack” approach addresses this by distributing visual tokens across multiple LLMs or Vision Transformers (ViTs) layers, enhancing performance and reducing overhead.

    DeepStack enhances LMMs using a dual-stream approach to incorporate fine-grained visual details without increasing context length. It divides image processing into a global view stream for overall information and a high-resolution stream that adds detailed image features across LLM layers. High-resolution tokens are upsampled and dilated, then fed into different LLM layers. This strategy significantly improves the model’s ability to handle complex visual inputs efficiently. Unlike traditional methods that concatenate visual tokens, DeepStack integrates them across layers, maintaining efficiency and enhancing the model’s visual processing capabilities.

    The experiments on DeepStack demonstrate its efficacy in enhancing multi-modal language models by integrating high-resolution visual tokens. Utilizing a two-stage training process, it leverages the CLIP image encoder to mosaic high-res image patches into whole-image features. During pre-training, the model uses 558k samples from LAION and other datasets, while fine-tuning incorporates 748k samples, adapting LLaVA’s pipeline. DeepStack consistently outperforms baselines like LLaVA on various VQA and multi-modal benchmarks, proving its capability to handle detailed visual information. It excels in text-oriented and zero-shot video QA tasks, confirming that early and strategic layer insertion of visual tokens significantly enhances model performance without extra computational cost.

    In conclusion, DeepStack introduces an innovative approach to enhancing LMMs by stacking visual tokens across multiple model layers rather than feeding them all into the first layer. This method reduces computational and memory demands while significantly improving performance on high-resolution tasks. By distributing visual tokens across different layers of the transformer, DeepStack enables more effective interactions between these tokens across layers. This results in substantial gains, outperforming traditional models like LLaVA on various benchmarks. The technique proves particularly advantageous in tasks demanding detailed visual comprehension, paving the way for more efficient and powerful multimodal models.

    Check out the Paper, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 44k+ ML SubReddit

    The post DeepStack: Enhancing Multimodal Models with Layered Visual Token Integration for Superior High-Resolution Performance appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA New Era AI Databases: PostgreSQL with pgvectorscale Outperforms Pinecone and Cuts Costs by 75% with New Open-Source Extensions
    Next Article Benchmarking Federated Learning for Large Language Models with FedLLM-Bench

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 19, 2025
    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    May 19, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Data Persistence with SwiftData [SUBSCRIBER]

    Learning Resources

    Will Microsoft invest in Bitcoin? Expert says Redmond can’t “miss the next technology wave”

    Development

    Microsoft Edge will auto-update PDF to Adobe Engine, won’t kill off legacy PDF until 2026

    Operating Systems

    It turns out you can only change VRAM on Legion Go and Legion Go S handhelds by going into the BIOS — Here’s how it works

    News & Updates

    Highlights

    Development

    Real Estate CRM Development: Cost, Features, and Best Practices

    August 29, 2024

    In today’s competitive real estate market, the ability to manage customer relationships efficiently can be…

    Provide a personalized experience for news readers using Amazon Personalize and Amazon Titan Text Embeddings on Amazon Bedrock

    August 29, 2024

    ZeroSearch from Alibaba Uses Reinforcement Learning and Simulated Documents to Teach LLMs Retrieval Without Real-Time Search

    May 10, 2025

    Google DeepMind Introduces a Parameter-Efficient Expert Retrieval Mechanism that Leverages the Product Key Technique for Sparse Retrieval from a Million Tiny Experts

    July 11, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.