Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

The introduction of Audio Description (AD) marks a big step towards making video content more accessible. AD provides a spoken narrative of important visual elements within a video that are unavailable in the original video track. However, making accurate AD requires a lot of resources, such as special expertise, equipment, and significant time investment. Also, making AD production automatic enhances the accessibility of videos for individuals with visual impairments. Still, a big challenge in automating AD is generating sentences of the right size that fit into the different temporal gaps within actor dialogue.Â

Recently, Large multimodal models (LMMs) have become popular in artificial intelligence, mostly focused on integrating various data types, including text, image, audio, and video, to become more reliable and intelligent. For example, GPT-4V is an LLM model that extends large language model GPT-4 with vision potential. Moreover, a method called MM-VID pioneered the use of the GPT-4V model for AD generation with the help of a two-step method. This process includes synthesizing condensed frame captions and refining the final AD output using GPT-4. Unfortunately, these methods donâ€™t have an explicit process for character recognition.Â

A team from Microsoft introduced an automated pipeline that utilizes GPT-4V(ision) to generate accurate AD for videos. This method uses a movie clip and its title information to generate AD content and utilizes the multimodal capabilities of GPT-4V by integrating visual signals from video frames with textual context to generate AD content. This method helps to adjust the size of the AD to fit the speech gap and adapt it for different kinds of videos by giving input to AD production guidelines showing how long the sentence should be in a simple, natural way.Â

The proposed method is tested using the MAD dataset, which includes a rich collection of over 264,000 audio descriptions from 488 movies. A simple version of the multiple-person tracker is utilized while developing this method for generating person tracklets, capturing all characters appearing in the input movie clip. The further process utilizes TransNetV2 to detect and break clips that contain multiple shots, and after generation of the tracklet, square patches are extracted around each person from the frames. Within the face patches, face detection is performed using the YOLOv7 model, facilitating crop and aligning face patches to a standard size of 112 Ã— 112 pixels.

GPT-4V was instructed to generate all AD in word counts, such as 6, 10, and 20 words, with the performance outcomes. In the AudioVault dataset, 80% of the AD contains ten words or fewer, 99% of the AD limits up to 20 words, and the selection of 6 words matches the datasetâ€™s average word count. The results show that the 10-word prompts show the highest ROUGE-L and CIDEr scores compared to the fixed word counts of 6, 10, and 20. The proposed method outperforms AutoAD-II, establishing a new state-of-the-art performance with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs 13.4), respectively.

In conclusion, a team from Microsoft proposed an automated pipeline that utilizes GPT-4V(ision) to generate accurate video AD. This method outperforms various methodologies in this paper, such as AutoAD-II, with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs. 13.4), respectively. However, the proposed method lacks a mechanism to determine suitable moments within a film to insert AD and estimate the related word count for that AD. So, in the future, there is a need to improve the generated AD quality, e.g., one can customize a lightweight language-rewriting model using available AD data to enhance the output from the LLM.Â

Check out theÂ Paper.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 41k+ ML SubReddit

The post Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-48187 – RAGFlow Authentication Bypass

Over 1,500 PostgreSQL Servers Compromised in Fileless Cryptocurrency Mining Campaign

This gadget can help you drive safer and save money

Gemini Live voice released and new ChatGPT-4o tops LMSYS

“Virlo is your short-form virality companion. “

I replaced my iPhone 16 Pro with the 16e for two weeks – here’s my buying advice update

CVE-2025-22886 – Apache OpenHarmony Memory Leak Denial of Service

ADI | Snap One and Perficient Win Coveo Relevance Accelerator Award

How to Use Langbase Memory Agents to Make Any LLM a Conversational AI for Your Docs

Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

Related Posts