Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

    Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos

    May 7, 2024

    The introduction of Audio Description (AD) marks a big step towards making video content more accessible. AD provides a spoken narrative of important visual elements within a video that are unavailable in the original video track. However, making accurate AD requires a lot of resources, such as special expertise, equipment, and significant time investment. Also, making AD production automatic enhances the accessibility of videos for individuals with visual impairments. Still, a big challenge in automating AD is generating sentences of the right size that fit into the different temporal gaps within actor dialogue. 

    Recently, Large multimodal models (LMMs) have become popular in artificial intelligence, mostly focused on integrating various data types, including text, image, audio, and video, to become more reliable and intelligent. For example, GPT-4V is an LLM model that extends large language model GPT-4 with vision potential. Moreover, a method called MM-VID pioneered the use of the GPT-4V model for AD generation with the help of a two-step method. This process includes synthesizing condensed frame captions and refining the final AD output using GPT-4. Unfortunately, these methods don’t have an explicit process for character recognition. 

    A team from Microsoft introduced an automated pipeline that utilizes GPT-4V(ision) to generate accurate AD for videos. This method uses a movie clip and its title information to generate AD content and utilizes the multimodal capabilities of GPT-4V by integrating visual signals from video frames with textual context to generate AD content. This method helps to adjust the size of the AD to fit the speech gap and adapt it for different kinds of videos by giving input to AD production guidelines showing how long the sentence should be in a simple, natural way. 

    The proposed method is tested using the MAD dataset, which includes a rich collection of over 264,000 audio descriptions from 488 movies. A simple version of the multiple-person tracker is utilized while developing this method for generating person tracklets, capturing all characters appearing in the input movie clip. The further process utilizes TransNetV2 to detect and break clips that contain multiple shots, and after generation of the tracklet, square patches are extracted around each person from the frames. Within the face patches, face detection is performed using the YOLOv7 model, facilitating crop and aligning face patches to a standard size of 112 × 112 pixels.

    GPT-4V was instructed to generate all AD in word counts, such as 6, 10, and 20 words, with the performance outcomes. In the AudioVault dataset, 80% of the AD contains ten words or fewer, 99% of the AD limits up to 20 words, and the selection of 6 words matches the dataset’s average word count. The results show that the 10-word prompts show the highest ROUGE-L and CIDEr scores compared to the fixed word counts of 6, 10, and 20. The proposed method outperforms AutoAD-II, establishing a new state-of-the-art performance with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs 13.4), respectively.

    In conclusion, a team from Microsoft proposed an automated pipeline that utilizes GPT-4V(ision) to generate accurate video AD. This method outperforms various methodologies in this paper, such as AutoAD-II, with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs. 13.4), respectively. However, the proposed method lacks a mechanism to determine suitable moments within a film to insert AD and estimate the related word count for that AD. So, in the future, there is a need to improve the generated AD quality, e.g., one can customize a lightweight language-rewriting model using available AD data to enhance the output from the LLM. 

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 41k+ ML SubReddit

    The post Microsoft AI Proposes an Automated Pipeline that Utilizes GPT-4V(ision) to Generate Accurate Audio Description AD for Videos appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleReinforcement Learning: Training AI Agents Through Rewards and Penalties
    Next Article DLAP: A Deep Learning Augmented LLMs Prompting Framework for Software Vulnerability Detection

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48187 – RAGFlow Authentication Bypass

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Over 1,500 PostgreSQL Servers Compromised in Fileless Cryptocurrency Mining Campaign

    Development

    This gadget can help you drive safer and save money

    Development

    Gemini Live voice released and new ChatGPT-4o tops LMSYS

    Artificial Intelligence

    “Virlo is your short-form virality companion. “

    Web Development

    Highlights

    I replaced my iPhone 16 Pro with the 16e for two weeks – here’s my buying advice update

    March 21, 2025

    If Apple Intelligence matters to you, and you want the latest iPhone at a lower…

    CVE-2025-22886 – Apache OpenHarmony Memory Leak Denial of Service

    May 6, 2025

    ADI | Snap One and Perficient Win Coveo Relevance Accelerator Award

    March 16, 2025

    How to Use Langbase Memory Agents to Make Any LLM a Conversational AI for Your Docs

    January 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.