Meta AI Presents MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

LLMs, pretrained on extensive textual data, exhibit impressive capabilities in generative and discriminative tasks. Recent interest focuses on employing LLMs for multimodal tasks, integrating them with visual encoders for tasks like captioning, question answering, classification, and segmentation. However, prior multimodal models face limitations in handling video inputs due to the context length restriction of LLMs and GPU memory constraints. For instance, while models like LLaMA have a context limit of 2048, others like LLaVA and BLIP-2 process only 256 and 32 tokens per image, respectively. This restricts their practicality for longer video durations such as movies or TV shows.

A simple solution like average pooling along the temporal axis, as used in VideoChatGPT, leads to inferior performance due to the absence of explicit temporal modeling. Another approach, as seen in Video-LLaMA, involves adding a video modeling component, such as an extra video querying transformer (Q-Former), to capture temporal dynamics and obtain video-level representation. However, this method increases model complexity, adds training parameters, and is unsuitable for online video analysis.

Researchers from the University of Maryland, Meta, and Central Florida propose a Memory-Augmented Large Multimodal Model (MA-LMM) for efficient long-term video modeling. It follows the structure of existing multimodal models, featuring a visual encoder, a querying transformer, and a large language model. Unlike previous methods, MA-LMM adopts an online processing approach, sequentially processing video frames and storing features in a long-term memory bank. This strategy significantly reduces GPU memory usage for long video sequences and effectively addresses context length limitations in LLMs. MA-LMM offers advantages over prior approaches, which consume substantial GPU memory and input text tokens.

The MA-LLM model architecture comprises three main components: (1) visual feature extraction using a frozen visual encoder, (2) long-term temporal modeling employing a trainable querying transformer (Q-Former) to align visual and text embeddings, and (3) text decoding with a frozen large language model. Frames are processed sequentially, associating new inputs with historical data in a long-term memory bank to retain discriminative information efficiently. The querying transformer integrates visual and textual information, while a compression technique reduces memory bank size without losing discriminative features. Finally, the model decodes text using the Q-Former output, addressing context length limitations and reducing GPU memory requirements during training.

MA-LMM demonstrates superior performance across various tasks compared to previous state-of-the-art methods. It outperforms existing models in long-term video understanding, video question answering, video captioning, and online action prediction tasks. MA-LMMâ€™s innovative design, utilizing a long-term memory bank and sequential processing, enables efficient handling of long video sequences and achieves remarkable results even in challenging scenarios. These findings prove the effectiveness and versatility of MA-LMM in multimodal video understanding applications.

To conclude, this research introduces a long-term memory bank to enhance existing large multimodal models, MA-LLM, for effectively modeling long video sequences. This approach addresses context length limitations and GPU memory constraints inherent in LLMs by processing video frames sequentially and storing historical data. As demonstrated in experiments, the long-term memory bank is easily integrated into existing models and shows superior advantages across various tasks.

Check out theÂ PaperÂ andÂ Github.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 40k+ ML SubReddit

The post Meta AI Presents MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

CodeSOD: Ready Xor Not

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Predicting the (actually very exciting) future of next gen Xbox hardware

With Astro Bot winning Game of the Year, Microsoft and Xbox need to start reinvesting in their platforming games

If ChatGPT produces AI-generated code for your app, who does it really belong to?

I tried an ultra-thin iPhone case, and here’s how my daunting experience went

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Predicting the (actually very exciting) future of next gen Xbox hardware

Predicting the (actually very exciting) future of next gen Xbox hardware

With Astro Bot winning Game of the Year, Microsoft and Xbox need to start reinvesting in their platforming games

Asus bombards Windows 11 with christmas.exe malware-like Christmas wreath banner

Meta AI Presents MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Predicting the (actually very exciting) future of next gen Xbox hardware

With Astro Bot winning Game of the Year, Microsoft and Xbox need to start reinvesting in their platforming games

AI Safety Benchmarks May Not Ensure True Safety: This AI Paper Reveals the Hidden Risks of Safetywashing

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Can I play Dragon Age: The Veilguard on Steam Deck, ROG Ally, and other gaming handhelds? â€” Best settings to use

ESET Threat Report H2 2024

Snapdragon 8 Elite helps Qualcomm records strong fourth quarter performance

GitHub Issues Urgent Security Advisory on Critical Vulnerability in GitHub Enterprise Server

UAEâ€™s AI ambitions face crucial test in White House talks

Microsoft removes numbering from â€œSurface Pro and Surface Laptop,â€ Arm powered

Meta AI Presents MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Related Posts