Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»Looking for a specific action in a video? This AI-based method can find it for you

    Looking for a specific action in a video? This AI-based method can find it for you

    May 29, 2024

    The internet is awash in instructional videos that can teach curious viewers everything from cooking the perfect pancake to performing a life-saving Heimlich maneuver.

    But pinpointing when and where a particular action happens in a long video can be tedious. To streamline the process, scientists are trying to teach computers to perform this task. Ideally, a user could just describe the action they’re looking for, and an AI model would skip to its location in the video.

    However, teaching machine-learning models to do this usually requires a great deal of expensive video data that have been painstakingly hand-labeled.

    A new, more efficient approach from researchers at MIT and the MIT-IBM Watson AI Lab trains a model to perform this task, known as spatio-temporal grounding, using only videos and their automatically generated transcripts.

    The researchers teach a model to understand an unlabeled video in two distinct ways: by looking at small details to figure out where objects are located (spatial information) and looking at the bigger picture to understand when the action occurs (temporal information).

    Compared to other AI approaches, their method more accurately identifies actions in longer videos with multiple activities. Interestingly, they found that simultaneously training on spatial and temporal information makes a model better at identifying each individually.

    In addition to streamlining online learning and virtual training processes, this technique could also be useful in health care settings by rapidly finding key moments in videos of diagnostic procedures, for example.

    “We disentangle the challenge of trying to encode spatial and temporal information all at once and instead think about it like two experts working on their own, which turns out to be a more explicit way to encode the information. Our model, which combines these two separate branches, leads to the best performance,” says Brian Chen, lead author of a paper on this technique.

    Chen, a 2023 graduate of Columbia University who conducted this research while a visiting student at the MIT-IBM Watson AI Lab, is joined on the paper by James Glass, senior research scientist, member of the MIT-IBM Watson AI Lab, and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, a member of the MIT-IBM Watson AI Lab who is also affiliated with Goethe University Frankfurt; and others at MIT, Goethe University, the MIT-IBM Watson AI Lab, and Quality Match GmbH. The research will be presented at the Conference on Computer Vision and Pattern Recognition.

    Global and local learning

    Researchers usually teach models to perform spatio-temporal grounding using videos in which humans have annotated the start and end times of particular tasks.

    Not only is generating these data expensive, but it can be difficult for humans to figure out exactly what to label. If the action is “cooking a pancake,” does that action start when the chef begins mixing the batter or when she pours it into the pan?

    “This time, the task may be about cooking, but next time, it might be about fixing a car. There are so many different domains for people to annotate. But if we can learn everything without labels, it is a more general solution,” Chen says.

    For their approach, the researchers use unlabeled instructional videos and accompanying text transcripts from a website like YouTube as training data. These don’t need any special preparation.

    They split the training process into two pieces. For one, they teach a machine-learning model to look at the entire video to understand what actions happen at certain times. This high-level information is called a global representation.

    For the second, they teach the model to focus on a specific region in parts of the video where action is happening. In a large kitchen, for instance, the model might only need to focus on the wooden spoon a chef is using to mix pancake batter, rather than the entire counter. This fine-grained information is called a local representation.

    The researchers incorporate an additional component into their framework to mitigate misalignments that occur between narration and video. Perhaps the chef talks about cooking the pancake first and performs the action later.

    To develop a more realistic solution, the researchers focused on uncut videos that are several minutes long. In contrast, most AI techniques train using few-second clips that someone trimmed to show only one action.

    A new benchmark

    But when they came to evaluate their approach, the researchers couldn’t find an effective benchmark for testing a model on these longer, uncut videos — so they created one.

    To build their benchmark dataset, the researchers devised a new annotation technique that works well for identifying multistep actions. They had users mark the intersection of objects, like the point where a knife edge cuts a tomato, rather than drawing a box around important objects.

    “This is more clearly defined and speeds up the annotation process, which reduces the human labor and cost,” Chen says.

    Plus, having multiple people do point annotation on the same video can better capture actions that occur over time, like the flow of milk being poured. All annotators won’t mark the exact same point in the flow of liquid.

    When they used this benchmark to test their approach, the researchers found that it was more accurate at pinpointing actions than other AI techniques.

    Their method was also better at focusing on human-object interactions. For instance, if the action is “serving a pancake,” many other approaches might focus only on key objects, like a stack of pancakes sitting on a counter. Instead, their method focuses on the actual moment when the chef flips a pancake onto a plate.

    Next, the researchers plan to enhance their approach so models can automatically detect when text and narration are not aligned, and switch focus from one modality to the other. They also want to extend their framework to audio data, since there are usually strong correlations between actions and the sounds objects make.

    This research is funded, in part, by the MIT-IBM Watson AI Lab.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Practical Guide to Purchase Order Systems
    Next Article Srinidhi Ranganathan: The Human AI with Extraordinary Hyperphantasia Super-Powers

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    I saw the future of AI at Qualcomm’s headquarters, and Copilot+ PCs were only just the beginning

    Development

    A New Validation Rule and the Ability to Manually Fail a Command in Laravel 11.8

    Development

    Apple Intelligence: Leading the Way in On-Device AI with Advanced Fine-Tuned Models and Privacy

    Development

    Ghost Ransomware Group Exploiting Unpatched Software, FBI and CISA Warn

    Development

    Highlights

    CVE-2025-3837 – “VMware End of Life OVA Connect Remote Code Execution Vulnerability”

    April 21, 2025

    CVE ID : CVE-2025-3837

    Published : April 21, 2025, 10:15 a.m. | 41 minutes ago

    Description : An improper input validation vulnerability is identified in the End of Life (EOL) OVA based connect component which is deployed for installation purposes in the customer internal network. This EOL component was deprecated in September 2023 with end of support extended till January 2024. Under certain circumstances, an actor can manipulate a specific request parameter and inject code execution payload which could lead to a remote code execution on the infrastructure hosting this component.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    XigmaNAS – storage NAS distribution

    May 8, 2025

    It Costs How Much?!? The Financial Pitfalls of Cyberattacks on SMBs

    May 6, 2024

    AI agents, multimodal Phi-3 unveiled at Microsoft Build 2024

    May 22, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.