Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      OpenAI o3-pro available in the API, BrowserStack adds Playwright support for real iOS devices, and more – Daily News Digest

      June 12, 2025

      Creating The “Moving Highlight” Navigation Bar With JavaScript And CSS

      June 11, 2025

      Microsoft Copilot’s own default configuration exposed users to the first-ever “zero-click” AI attack, but there was no data breach

      June 13, 2025

      Sam Altman says “OpenAI was forced to do a lot of unnatural things” to meet the Ghibli memes demand surge

      June 13, 2025

      5 things we didn’t get from the Xbox Games Showcase, because Xbox obviously hates me personally

      June 13, 2025

      Minecraft Vibrant Visuals finally has a release date and it’s dropping with the Happy Ghasts

      June 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      QAQ-QQ-AI-QUEST

      June 13, 2025
      Recent

      QAQ-QQ-AI-QUEST

      June 13, 2025

      JS Dark Arts: Abusing prototypes and the Result type

      June 13, 2025

      Helpful Git Aliases To Maximize Developer Productivity

      June 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft Copilot’s own default configuration exposed users to the first-ever “zero-click” AI attack, but there was no data breach

      June 13, 2025
      Recent

      Microsoft Copilot’s own default configuration exposed users to the first-ever “zero-click” AI attack, but there was no data breach

      June 13, 2025

      Sam Altman says “OpenAI was forced to do a lot of unnatural things” to meet the Ghibli memes demand surge

      June 13, 2025

      5 things we didn’t get from the Xbox Games Showcase, because Xbox obviously hates me personally

      June 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Make videos accessible with automated audio descriptions using Amazon Nova

    Make videos accessible with automated audio descriptions using Amazon Nova

    June 13, 2025

    According to the World Health Organization, more than 2.2 billion people globally have vision impairment. For compliance with disability legislation, such as the Americans with Disabilities Act (ADA) in the United States, media in visual formats like television shows or movies are required to provide accessibility to visually impaired people. This often comes in the form of audio description tracks that narrate the visual elements of the film or show. According to the International Documentary Association, creating audio descriptions can cost $25 per minute (or more) when using third parties. For building audio descriptions internally, the effort for businesses in the media industry can be significant, requiring content creators, audio description writers, description narrators, audio engineers, delivery vendors and more according to the American Council of the Blind (ACB). This leads to the natural question, can you automate this process with the help of generative AI offerings in Amazon Web Services (AWS)?

    Newly announced in December at re:Invent 2024, the Amazon Nova Foundation Models family is available through Amazon Bedrock and includes three multimodal foundational models (FMs):

    • Amazon Nova Lite (GA) – A low-cost multimodal model that’s lightning-fast for processing image, video, and text inputs
    • Amazon Nova Pro (GA) – A highly capable multimodal model with a balanced combination of accuracy, speed, and cost for a wide range of tasks
    • Amazon Nova Premier (GA) – Our most capable model for complex tasks and a teacher for model distillation

    In this post, we demonstrate how you can use services like Amazon Nova, Amazon Rekognition, and Amazon Polly to automate the creation of accessible audio descriptions for video content. This approach can significantly reduce the time and cost required to make videos accessible for visually impaired audiences. However, this post doesn’t provide a complete, deployment-ready solution. We share pseudocode snippets and guidance in sequential order, in addition to detailed explanations and links to resources. For a complete script, you can use additional resources, such as Amazon Q Developer, to build a fully functional system. The automated workflow described in the post involves analyzing video content, generating text descriptions, and narrating them using AI voice generation. In summary, while powerful, this requires careful integration and testing to deploy effectively. By the end of this post, you’ll understand the key steps, but some additional work is needed to create a production-ready solution for your specific use case.

    Solution overview

    The following architecture diagram demonstrates the end-to-end workflow of the proposed solution. We will describe each component in-depth in the later sections of this post, but note that you can define the logic within a single script. You can then run your script on an Amazon Elastic Compute Cloude (Amazon EC2) instance or on your local computer. For this post, we assume that you will run the script on an Amazon SageMaker notebook.

    End-to-end AWS workflow demonstrating video content analysis using AI services to generate text descriptions and audio narration

    Services used

    The services shown in the architecture diagram include:

    1. Amazon S3 – Amazon Simple Storage Service (Amazon S3) is an object storage service that provides scalable, durable, and highly available storage. In this example, we use Amazon S3 to store the video files (input) and scene description (text files) and audio description (MP3 files) output generated by the solution. The script starts by fetching the source video from an S3 bucket.
    2. Amazon Rekognition – Amazon Rekognition is a computer vision service that can detect and extract video segments or scenes by identifying technical cues such as shot boundaries, black frames, and other visual elements. To yield higher accuracy for the generated video descriptions, you use Amazon Rekognition to segment the source video into smaller chunks before passing it to Amazon Nova. These video segments can be stored in a temporary directory on your compute machine.
    3. Amazon Bedrock – Amazon Bedrock is a managed service that provides access to large, pre-trained AI models such as the Amazon Nova Pro model, which is used in this solution to analyze the content of each video segment and generate detailed scene descriptions. You can store these text descriptions in a text file (for example, video_analysis.txt).
    4. Amazon Polly – Amazon Polly is a text-to-speech service that is used to convert the text descriptions generated by the Amazon Nova Pro model into high-quality audio, made available using an MP3 file.

    Prerequisites

    To follow along with the solution outlined in this post, you should have the following in place:

    • A video file. For this post, we use a public domain video, This is Coffee.
    • An AWS account with access to the following services:
      • Amazon Rekognition
      • Amazon Nova Pro
      • Amazon S3
      • Amazon Polly
      • Configure your AWS Command Line Interface (AWS CLI) or environment with valid credentials (using aws configure or environment variables)
    • To write the script, you need access to an AWS Software Development Kit (AWS SDK) in the language of your choice. In this post, we assume that you will use the AWS SDK for Python (Boto3). Additional information on AWS SDK for Boto3 is available in the Quickstart for Boto3.

    You can use AWS SDK to create, configure, and manage AWS services. For Boto3, you can include it at the top of your script using: import boto3

    Additionally, you need a mechanism to split videos. If you’re using Python, we recommend the moviepy library.
    import moviepy # pip install moviepy

    Solution walkthrough

    The solution includes the following basic steps, which you can use as a basic structure and customize or expand to fit your use case.

    1. Define the requirements for the AWS environment, including defining the use of the Amazon Nova Pro model for its visual support and the AWS Region you’re working in. For optimal throughput, we recommend using inference profiles when configuring Amazon Bedrock to invoke the Amazon Nova Pro model. Initialize a client for Amazon Rekognition, which you use for its support of segmentation.
    CLASS VideoAnalyzer:
    	FUNCTION initialize():
     		Set AWS_REGION to "us-east-1"
     		Set MODEL_ID to "amazon.nova-pro-v1:0"
     		Set chunk_delay to 20 Initialize AWS clients (Bedrock and Rekognition)
    1. Define a function for detecting segments in the video. Amazon Rekognition supports segmentation, which means users have the option to detect and extract different segments or scenes within a video. By using the Amazon Rekognition Segment API, you can perform the following:
      1. Detect technical cues such as black frames, color bars, opening and end credits, and studio logos in a video.
      2. Detect shot boundaries to identify the start, end, and duration of individual shots within the video.

    The solution uses Amazon Rekognition to partition the video into multiple segments and perform Amazon Nova Pro-based inference on each segment. Finally, you can piece together each segment’s inference output to return a comprehensive audio description for the entire video.

    FUNCTION get_segment_results(job_id):
     	TRY:
     	   Initialize empty segments list 
     	   WHILE more results exist:
     	         Get segment detection results 
                    Add segments to list 
                    IF no more results THEN break
              RETURN segments 
           CATCH any errors and return null 
    
    FUNCTION extract_scene(video_path, start_time, end_time):
           TRY: 
               Load video file 
               Validate time range
               Create temporary directory 
               Extract video segment 
               Save segment to file 
               RETURN path to saved segment 
           CATCH any errors and return null
    

    Three coffee cups on checkered tablecloth and close-up of coffee grounds in cup

    In the preceding image, there are two scenes: a screenshot of one scene on the left followed by the scene that immediately follows it on the right. With the Amazon Rekognition segmentation API, you can identify that the scene has changed—that the content that is displayed on screen is different—and therefore you need to generate a new scene description.

    1. Create the segmentation job and:
      • Upload the video file for which you want to create an audio description to Amazon S3.
      • Start the job using that video.

    Setting SegmentType=[‘SHOT’] identifies the start, end, and duration of a scene. Additionally, MinSegmentConfidence sets the minimum confidence Amazon Rekognition must have to return a detected segment, with 0 being lowest confidence and 100 being highest.

    1. Use the analyze_chunk function. This function defines the main logic of the audio description solution. Some items to note about analyze_chunk:
      • For this example, we sent a video scene to Amazon Nova Pro for an analysis of the contents using the prompt Describe what is happening in this video in detail. This prompt is relatively straightforward and experimentation or customization for your use case is encouraged. Amazon Nova Pro then returned the text description for our video scene.
      • For longer videos with many scenes, you might encounter throttling. This is resolved by implementing a retry mechanism. For details on throttling and quotas for Amazon Bedrock, see Quotas for Amazon Bedrock.
    FUNCTION analyze_chunk(chunk_path): 
         TRY: 
            Convert video chunk to base64 
            Create request body for Bedrock 
            Set max_retries and backoff_time 
    
            WHILE retry_count < max_retries:
              TRY:
                 Send InvokeModel request to Bedrock
                 RETURN analysis results 
              CATCH throttling: 
                  Wait and retry with exponential backoff 
              CATCH other errors: 
                  Return null 
         CATCH any errors:
             Return null

    In effect, the raw scenes are converted into rich, descriptive text. Using this text, you can generate a complete scene-by-scene walkthrough of the video and send it to Amazon Polly for audio.

    1. Use the following code to orchestrate the process:
      1. Initiate the detection of the various segments by using Amazon Rekognition.
      2. Each segment is processed through a flow of:
        1. Extraction.
        2. Analysis using Amazon Nova Pro.
        3. Compiling the analysis into a video_analysis.txt file.
    2. The analyze_video function brings together all the components and produces a text file that contains the complete, scene-by-scene analysis of the video contents, with timestamps
    FUNCTION analyze_video(video_path, bucket): 
         TRY: 
             Start segment detection 
             Wait for job completion 
             Get segments 
             FOR each segment: 
                 Extract scene 
                 Analyze chunk 
                 Save analysis results 
             Write results to file 
          CATCH any errors
    

    If you refer back to the previous screenshot, the output—without any additional refinement—will look similar to the following image.

    Three coffee cups on checkered tablecloth and close-up of coffee grounds in cup

    “Segment 103.136-126.026 seconds:
    [{'text': 'The video shows a close-up of a coffee cup with steam rising from it, followed by three cups of coffee on a table with milk and sugar jars. A person then picks up a bunch of coffee beans from a plant.'}]
    Segment 126.059-133.566 seconds:
    [{'text': "The video starts with a person's hand, covered in dirt and holding a branch with green leaves and berries. The person then picks up some berries. The video then shows a man standing in a field with trees and plants. He is holding a bunch of red fruits in his right hand and looking at them. He is wearing a shirt and has a mustache. He seems to be picking the fruits. The fruits are probably coffee beans. The area is surrounded by green plants and trees."}]” 

    The following screenshot is an example is a more extensive look at the video_analysis.txt for the coffee.mp4 video:

    Detailed video analysis text file displaying 12 chronological segments with timestamps, describing a day's journey from waking up to coffee cultivation and brewing.

    1. Send the contents of the text file to Amazon Polly. Amazon Polly adds a voice to the text file, completing the workflow of the audio description solution.
    FUNCTION generate_audio(text_file, output_audio_file):
         TRY:
            Read analysis text
            Set max_retries and backoff_time
    
            WHILE retry_count < max_retries:
               TRY:
                  Initialize Polly client
                  Convert text to speech
                  Save audio file
                  RETURN success
               CATCH throttling:
                  Wait with exponential backoff
                  retry_count += 1
               CATCH other errors:
                  retry_count += 1
                  Continue or Break based on error type
         CATCH any errors:
             RETURN error

    For a list of different voices that you can use in Amazon Polly, see Available voices in the Amazon Polly Developer Guide.

    Your final output with Polly should sound something like this:

    Clean up

    It’s a best practice to delete the resources you provisioned for this solution. If you used an EC2 or SageMaker Notebook Instance, stop or terminate it. Remember to delete unused files from your S3 bucket (eg: video_analysis.txt and video_analysis.mp3).

    Conclusion

    Recapping the solution at a high level, in this post, you used:

    • Amazon S3 to store the original video, intermediate data, and the final audio description artifacts
    • Amazon Rekognition to partition the video file into time-stamped scenes
    • Computer vision capabilities from Amazon Nova Pro (available through Amazon Bedrock) to analyze the contents of each scene

    We showed you how to use Amazon Polly to create an MP3 audio file from the final scene description text file, which is what will be consumed by the audience members. The solution outlined in this post demonstrates how to fully automate the process of creating audio descriptions for video content to improve accessibility. By using Amazon Rekognition for video segmentation, the Amazon Nova Pro model for scene analysis, and Amazon Polly for text-to-speech, you can generate a comprehensive audio description track that narrates the key visual elements of a video. This end-to-end automation can significantly reduce the time and cost required to make video content accessible for visually impaired audiences, helping businesses and organizations meet their accessibility goals. With the power of AWS AI services, this solution provides a scalable and efficient way to improve accessibility and inclusion for video-based media.

    This solution isn’t limited to using it for TV shows and movies. Any visual media that requires accessibility can be a candidate! For more information about the new Amazon Nova model family and the amazing things these models can do, see Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance.

    In addition to the steps described in this post, additional actions you might need to take include:

    • Removing a video segment analysis’s introductory text from Amazon Nova. When Amazon Nova returns a response, it might begin with something like “In this video…” or something similar. You probably want just the video description itself without this introductory text. If there is introductory text in your scene descriptions, then Amazon Polly will speak it aloud and impact the quality of your audio transcriptions. You can account for this in a few ways.
      • For example, prior to sending it to Amazon Polly, you can modify the generated scene descriptions by programmatically removing that type of text from them.
      • Alternatively, you can use prompt engineering to request that Amazon Bedrock return only the scene descriptions in a structured format or without any additional commentary.
      • The third option is to define and use a tool when performing inference on Amazon Bedrock. This can be a more comprehensive technique of defining the format of the output that you want Amazon Bedrock to return. Using tools to shape model output, is known as function calling. For more information, see Use a tool to complete an Amazon Bedrock model response.
    • You should also be mindful of the architectural components of the solution. In a production environment, being mindful of any potential scaling, security, and storage elements is important because the architecture might begin to resemble something more complex than the basic solution architecture diagram that this post began with.

    About the Authors

    Dylan Martin is an AWS Solutions Architect, working primarily in the generative AI space helping AWS Technical Field teams build AI/ML workloads on AWS. He brings his experience as both a security solutions architect and software engineer. Outside of work he enjoys motorcycling, the French Riviera and studying languages.

    Ankit Patel is an AWS Solutions Developer, part of the Prototyping And Customer Engineering (PACE) team. Ankit helps customers bring their innovative ideas to life by rapid prototyping; using the AWS platform to build, orchestrate, and manage custom applications.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova
    Next Article Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 13, 2025
    Machine Learning

    Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

    June 13, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5429 – Juzaweb CMS Remote Improper Access Control Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-3983 – AMTT Hotel Broadband Operation System NLog Down.php Remote Command Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4917 – PHPGurukul Auto Taxi Stand Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    I played Gears of War: Reloaded on the ROG Xbox Ally X, and it’s a fantastic experience that blew me away

    News & Updates

    Highlights

    CVE-2025-4350 – D-Link DIR-600L Wake-on-LAN Command Injection Vulnerability

    May 6, 2025

    CVE ID : CVE-2025-4350

    Published : May 6, 2025, 12:15 p.m. | 2 hours, 44 minutes ago

    Description : A vulnerability classified as critical was found in D-Link DIR-600L up to 2.07B01. This vulnerability affects the function wake_on_lan. The manipulation of the argument host leads to command injection. The attack can be initiated remotely. This vulnerability only affects products that are no longer supported by the maintainer.

    Severity: 8.8 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-5382 – Devolutions Server Access Control Bypass

    June 5, 2025

    CVE-2025-46191 – SourceCodester Client Database Management System Remote Code Execution Vulnerability

    May 9, 2025

    plakativ stretches PDF or raster image across multiple pages

    April 20, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.