LongVA and the Impact of Long Context Transfer in Visual Processing: Enhancing Large Multimodal Models for Long Video Sequences

June 29, 2024

The field of research focuses on enhancing large multimodal models (LMMs) to process and understand extremely long video sequences. Video sequences offer valuable temporal information, but current LMMs need help to understand exceptionally long videos. This issue stems from the sheer volume of visual tokens generated by the vision encoders, making it challenging for existing models to handle them efficiently.

One significant problem this research addresses is the need for current LMMs to effectively process and understand long videos. This challenge arises due to the excessive number of visual tokens produced by vision encoders. For instance, models like LLaVA-1.6 generate between 576 and 2880 visual tokens for a single image, which escalates significantly with more frames. This creates a bottleneck in processing and understanding long video sequences, necessitating innovative solutions.

Existing methods to tackle this problem include visual resamplers to reduce the number of visual tokens and heuristic techniques to prune or merge visual features. Despite these efforts, most LMMs still need help to process many frames effectively. Current methods like the visual resampler used by models such as MPLUG-Owl-video and MovieChat attempt to compress the visual features but must catch up when dealing with extensive video data.

Researchers from the LMMs-Lab Team, NTU, and SUTD in Singapore have introduced an innovative approach called Long Context Transfer to address this challenge. This approach extends the context length of the language model backbone, enabling it to process a significantly larger number of visual tokens. This method is unique because it does not require additional video training. Instead, it leverages the extended context length of the language model, allowing LMMs to comprehend orders of magnitude more visual tokens. This research was conducted by the LMMs-Lab team.

The proposed model, Long Video Assistant (LongVA), extends the context length of the language model by training it on longer text data. This context-extended language model is then aligned with visual inputs, allowing the model to process long videos effectively without additional complexity. The UniRes encoding scheme, which unifies the representation of images and videos, plays a crucial role in this process. LongVA can treat videos as extended images during inference, significantly enhancing its ability to process long video sequences.

LongVAâ€™s performance on the Video-MME dataset demonstrates its capability to handle long videos. It can process up to 2000 frames or over 200,000 visual tokens, setting a new benchmark in this area. The Visual Needle-In-A-Haystack (V-NIAH) benchmark was developed to measure LMMsâ€™ ability to locate and retrieve visual information over long contexts. LongVA showed superior performance in these evaluations, retrieving visual information accurately from up to 3000 frames.

Experiments showed that LongVA could effectively process and understand long videos, achieving state-of-the-art performance among 7B-scale models. The model was trained on a context length of 224K tokens, equivalent to 1555 frames, and it generalizes well beyond that, maintaining performance within 3000 frames. This demonstrates the effectiveness of the long context transfer phenomenon, where the extended context of the language model enhances the visual processing capabilities of the LMMs.

The researchers conducted detailed experiments to validate their approach. They used Qwen2-7B-Instruct as the backbone language model and performed continued pretraining with a context length of 224K over 900 million tokens. The training framework was designed to be memory efficient and maintain high GPU occupancy. The long context training was completed in just two days using eight A100 GPUs, showcasing the feasibility of this approach within academic budgets.

In conclusion, this research addresses the critical problem of processing and understanding long video sequences in large multimodal models. By extending the context length of the language model and aligning it with visual inputs, the researchers significantly improved the LMMsâ€™ capability to handle long videos. The proposed LongVA model demonstrates substantial performance improvements, processing up to 2000 frames or over 200,000 visual tokens and setting a new standard for LMMs in this field. This work highlights the potential of long context transfer to enhance the capabilities of LMMs for long video processing.

Check out the Paper, Project, and Demo. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generallyÂ available! [Advertisement]

The post LongVA and the Impact of Long Context Transfer in Visual Processing: Enhancing Large Multimodal Models for Long Video Sequences appeared first on MarkTechPost.

Source: Read MoreÂ

Previous ArticleCMU Researchers Propose In-Context Abstraction Learning (ICAL): An AI Method that Builds a Memory of Multimodal Experience Insights from Sub-Optimal Demonstrations and Human Feedback

Next Article OpenAI Introduces CriticGPT: A New Artificial Intelligence AI Model based on GPT-4 to Catch Errors in ChatGPTâ€™s Code Output

Highlights

Development

Cypress Automation Tutorial: Streamline Your Web Automation Testing

April 21, 2024

Â Why you should switch to Cypress for modern web testing?â€œI think youâ€™ll agree with me when I sayâ€¦Test your code not your patienceâ€So, What I mean with the above line is that we are in an era where the web is evolving extremely with the deployment ofÂ angular.js,Â react.js,Â Vue.jsÂ andÂ p5.jsÂ based web applications. These modern web applications are responsive, communicative(using chatbots) and built on top of material design.We as software automation engineers are traditionally following the approach that has been started a decade ago. Yes, you got it right! I am talking aboutÂ seleniumÂ here. Also, ten years back web wasnâ€™t the same as it is today.Since then web has evolved much, hence the testing should too!Testing is one of the critical processes in application development. The success or failure of the application entirely depends on it. However, website testing is totally different from conventional software testing. Following are some factors that could be a big hurdle to the testing efforts and make web testing more challenging for the testers.Challenges in Modern Web Testing:Dealing with XHR calls and web servicesShort deployment sprints, and major time involved in testingSecurity of dataVery expensive to maintain due to lack of infrastructure for testingDynamic behavior of applications due to modern development frameworksMany more yet to come in the futureâ€¦These are someÂ problemsÂ associated with selenium. Selenium has been a major player in E2E web application testing for a decade now. But the modern web is different today, in order to overcome these shortcomings of selenium cypress comes into the picture here.Why Cypress?Cypress is a JavaScript-based end-to-end testing framework that doesnâ€™t uses selenium at all. It is built on top of mocha which is again a feature-rich JavaScript test framework running onÂ Node.jsÂ and in the browser, making asynchronous testingÂ simpleÂ andÂ fun.Cypress also usesÂ ChaiÂ a BDD / TDD assertion library forÂ nodeÂ and the browser that can be delightfully paired with any JavaScript testing framework.Well, the developer ofÂ Cypress.ioÂ Brian Mann, through a survey collected data on testing challenges and addressed most of the shortcomings by developing Cypress. Although Cypress has many handy advantages I want to highlight only those that I found fascinating.Automatic WaitingÂ -Â Cypress automatically waits for – the DOM to load, the element to become visible, the animation to get completed, the XHR and AJAX calls to be finished and many more. Hence, no need to define implicit and Explicit waits.Real-Time ReloadsÂ – Cypress is intelligent enough to know that after saving your test file(xyz_spec.js file) you are gonna run it again. So Cypress automatically triggers the run next to your browser as soon as you pressÂ CTRL+SÂ to save your file. Hence, no need to manually trigger the run.DebuggabilityÂ -Â Cypress gives you the ability to directly debug your app under test fromÂ Chrome Dev-tools,Â It not only gives you straightforward error messages but also suggests how you should approach them.Some more Advantages of Cypress:Fig 1.0: Cypress FeaturesTo have more insight into Cypress’s Advantages over other automation tools, please readÂ Cypress AdvantagesWhat makes Cypress Different?ArchitectureÂ – Most testing tools operate by running outside of the browser and executing remote commands across the network.Â Cypress is the exact opposite.Â Cypress is executed in the same run loop as your application.Works On Network LayerÂ – Cypress also operates at the network layer by reading and altering web traffic on the fly. This enables Cypress to not only modify everything coming in and out of the browser, but also to change code that may interfere with its ability to automate the browser. Cypress ultimately controls the entire automation process from top to bottom.New Kind Of TestingÂ -Â Having ultimate control over your application, the network traffic, and native access to every host object unlocks a new way of testing that has never been possible before. Instead of beingÂ â€˜locked outâ€™Â of your application and not being able to easily control it â€” Cypress instead lets you alter any aspect of how your application works.Test how your application responds to errors on your server byÂ modifying response status codes to 500Â so that timers or polls automatically fire without having to wait for the required time in your tests.ShortcutsÂ – Cypress prevents you from being forced to alwaysÂ â€˜act like a userâ€™Â to generate the state of a given situation. That means you do not have to visit a login page, type in a username and password and wait for the page to load and/or redirect for every test you run. Cypress gives you the ability to take shortcuts and programmatically log in.Shift left paradigm in CypressInstalling CypressInstalling cypressÂ is a fairly easy task. The only thing you need to have isÂ node.jsÂ installed in your machine and then itâ€™s all about twoÂ npmÂ commands -1.Â npm init2.Â npm install cypress –save-devThe first command will create aÂ package.jsonÂ file and the second command will install cypress as a â€˜devDependenciesâ€™ array in your package descriptor (package.json) file.Installing Cypress will take around 2 to 3 mins based on your network speedCypress has now been installed to yourÂ ./node_modulesÂ directory. Once you have done with the installation part, you are gonna open Cypress for the first time by executing this command at the location where you have yourÂ package.jsonÂ file -./node_modules/.bin/cypress openTo view the full installation video clickÂ here.Â This will open cypress GUI like -Fig 1.1: GUI-based Cypress Test RunnerCypress comes with its own folder structure, this folder automatically gets generated when you open Cypress for the first time at that location. It comes with ready-made recipes that show you how to test common scenarios in Cypress.Fig 1.2: Cypress folder structureWe keep our test data in JSON format inside theÂ fixtureÂ folder and writes test inside theÂ integrationÂ folder following the same naming convention. Any custom command will come under the support folder.Writing Your First Test using CypressLetâ€™s create a new fileÂ kitchensink_spec.jsÂ in theÂ cypress/integrationÂ folder. Open up your favorite IDE and add the code below to ourÂ kitchensink_spec.jsÂ test file./** * @author Shivam Bharadwaj * @description Cypress Demo */
//This is where your test suite startsdescribe(‘My First Test’, function () {
//This function will execute before each test (i.e it()) beforeEach(function () { //Visiting the url cy.visit(‘https://example.cypress.io’) })
//Here you actually writes your test (it() is similar to @Test annotaion of TestNG) it(‘Visits the Kitchen Sink’, function () {
//Click on type button cy.contains(‘type’).click()
// Should be on a new URL which includes ‘/commands/actions’ cy.url().should(‘include’, ‘/commands/actions’)
// Get an input, type into it and verify that the value has been updated cy.get(‘.action-email’) .type(‘fake@email.com’) .should(‘have.value’, ‘fake@email.com’) })})Code Explanation -Line 7 is creating a test suite with the name â€˜My First Testâ€™.Â Line 10 is creating a function that runs before each test.Â Line 12 with the simple cy.visit command passing the URL we want to visit.Â With line 16 we are actually writing a test having the name â€˜Visits the Kitchen Sinkâ€™Â And inside it at line 19, we are kind of making an assertion first, and then if DOM contains a â€˜typeâ€™ word on UI it triggers a click event on it.At line 22 we are verifying that after clicking the new URL should containÂ /commands/cations.Â Finally in line 25 to 27 we first finding the element by its class name, typingÂ fake@email.comÂ in it and finally verifying that the correct value is typed.To view the short video of the code clickÂ hereOutputÂ -Wow!! It took only 7.89 seconds for the application to load, type some values and verify the assertion. Itâ€™s incredible!!Fig 1.4: Console outputusing cypress we can automatically travel back in time by justÂ hovering over the eventÂ within our application under test in such a way that it takes you to that moment where the application was at the time of the event triggered. But as we hover over theÂ CONTAINS, Cypress reverts back to the URL that was present when our snapshot was taken.Notice there is also a funny looking Log called:Â (PAGE LOAD)Â followed by another entry forÂ (NEW URL). Neither of these was a command that we issued rather Cypress itself will log out important events from your application when they occur.As you can see this is the console view at the bottom of the image(Fig1.4) where you can find all the information about the event like command, selector, value, matched elements, and yielded.Congratulations!!! You have tested your app with Cypress.ConclusionHence, we might think to switch our tactics and use Cypress as our primary E2E tool. It works as expected and makes our lives a lot easier.I have used Cypress way too little to like it very much and think this is the tool we required.In any way do try Cypress.To know more about cypress refer to the links below -Referencesexample source codeCypress.io documentationhttps://github.com/cypress-io/cypressGuest Author:Shivam BharadwajSay Hi on Twitter

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

LongVA and the Impact of Long Context Transfer in Visual Processing: Enhancing Large Multimodal Models for Long Video Sequences

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Cassowary – run Windows virtual machine on Linux

Fix: ERROR_RANGE_LIST_CONFLICT 627 (0x273) in Windows

The MongoDB AI Applications Program: Delivering Customer Value

Eviden scales AWS DeepRacer Global League using AWS DeepRacer Event Manager

Cypress Automation Tutorial: Streamline Your Web Automation Testing

GNS3 – network simulator

Implementing advanced prompt engineering with Amazon Bedrock

BianLian Ransomware Targets Better Business Bureau, US Dermatology Partners

LongVA and the Impact of Long Context Transfer in Visual Processing: Enhancing Large Multimodal Models for Long Video Sequences

Related Posts