ST-LLM: An Effective Video-LLM Baseline with Spatial-Temporal Sequence Modeling Inside LLM

The world of artificial intelligence has been abuzz with the remarkable achievements of Large Language Models (LLMs) like GPT, PaLM, and LLaMA. These models have demonstrated an impressive understanding and generation of natural language, signaling a promising step toward artificial general intelligence. However, while LLMs excel at processing text, extending their capabilities to videos with rich temporal information has been a significant challenge.

Existing approaches to enable video understanding in LLMs have had their limitations. Some methods rely on the average pooling of video frames, which fails to capture the dynamic temporal sequences effectively. Others incorporate additional structures for temporal sampling and modeling, but these solutions demand extensive computational resources and often require multi-stage pretraining.Â

To tackle this challenge, a team of researchers from Peking University and Tencent has proposed a novel approach called ST-LLM. The core idea is simple yet unexplored: leverage the robust sequence modeling capabilities inherent in LLMs to process raw spatial-temporal video tokens directly.

ST-LLM feeds all video frames into the LLM, as shown in Figure 2 and 3, allowing it to model the spatial-temporal sequences effectively. The researchers introduce a dynamic video token masking strategy and masked video modeling during training to address the potential issue of increased context length for long videos. This approach not only reduces the sequence length but also enhances the modelâ€™s robustness to varying video lengths during inference.

For particularly long videos, ST-LLM employs a unique global-local input mechanism. It combines the average pooling of a large number of frames (global representation) with a smaller subset of frames (local representation). This asymmetric design enables processing a large number of video frames while preserving the modeling of video tokens within the LLM.

Extensive experiments on various video benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot video QA, have demonstrated the remarkable effectiveness of ST-LLM. Qualitatively, the model exhibits superior temporal understanding compared to other video LLMs, accurately capturing even complex motion and scene transitions. Quantitatively, ST-LLM achieves state-of-the-art performance, particularly excelling in metrics related to temporal-sensitive motion.

While ST-LLM struggles with fine-grained tasks like pose estimation, its ability to leverage the LLMâ€™s sequence modeling capabilities without introducing additional modules or expensive pretraining is a significant advantage. The researchers have successfully harnessed the power of LLMs for video understanding, opening up new possibilities in this domain.

Check out theÂ PaperÂ andÂ Github.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 39k+ ML SubReddit

The post ST-LLM: An Effective Video-LLM Baseline with Spatial-Temporal Sequence Modeling Inside LLM appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

ST-LLM: An Effective Video-LLM Baseline with Spatial-Temporal Sequence Modeling Inside LLM

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Computer Memory: Types, Characteristics, How It Works

Embracing Neurodiversity During Autism Awareness Month

Microsoft risks EU fine up to 10% of its global annual turnover over Teamsâ€™ bundling practice

Adding real-time ML predictions for your Amazon Aurora database: Part 2

Creating Fullscreen Animations with CSS Grid and GSAP Flip

How publishers use Figma to help design the news

Color Psychology in UI/UX Design

Llama 3.1 vs GPT-4o vs Claude 3.5: A Comprehensive Comparison of Leading AI Models

ST-LLM: An Effective Video-LLM Baseline with Spatial-Temporal Sequence Modeling Inside LLM

Related Posts