Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      Designing Better UX For Left-Handed People

      July 25, 2025

      This week in AI dev tools: Gemini 2.5 Flash-Lite, GitLab Duo Agent Platform beta, and more (July 25, 2025)

      July 25, 2025

      Microsoft wants you to chat with its browser now – but can you trust this Copilot?

      July 28, 2025

      I tested the Dell XPS’ successor – here are the biggest upgrades (and what’s the same)

      July 28, 2025

      I’m a Linux pro – here are my top 5 command line backup tools for desktops and servers

      July 28, 2025

      Should you buy a refurbished iPad? I tried one from Back Market and here’s my verdict

      July 28, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      elegantweb/sanitizer

      July 28, 2025
      Recent

      elegantweb/sanitizer

      July 28, 2025

      Streamlined String Encryption with Laravel’s Fluent Methods

      July 28, 2025

      Resume PHP

      July 28, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Gamers bypass UK age verification with Death Stranding — no real face or VPN required

      July 28, 2025
      Recent

      Gamers bypass UK age verification with Death Stranding — no real face or VPN required

      July 28, 2025

      New Xbox games launching this week, from July 28 through August 3 — Grounded 2 arrives on Xbox Game Pass

      July 28, 2025

      TikTok’s owner forked Microsoft’s Visual Studio Code and concerns have been raised — reports suggest it’s resource heavy and never stops ‘phoning home’

      July 28, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets

    ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets

    April 28, 2025

    Video captioning models are typically trained on datasets consisting of short videos, usually under three minutes in length, paired with corresponding captions. While this enables them to describe basic actions like walking or talking, these models struggle with the complexity of long-form videos, such as vlogs, sports events, and movies that can last over an hour. When applied to such videos, they often generate fragmented descriptions focused on isolated actions rather than capturing the broader storyline. Efforts like MA-LMM and LaViLa have extended video captioning to 10-minute clips using LLMs, but hour-long videos remain a challenge due to a shortage of suitable datasets. Although Ego4D introduced a large dataset of hour-long videos, its first-person perspective limits its broader applicability. Video ReCap addressed this gap by training on hour-long videos with multi-granularity annotations, yet this approach is expensive and prone to annotation inconsistencies. In contrast, annotated short-form video datasets are widely available and more user-friendly.

    Advancements in visual-language models have significantly enhanced the integration of vision and language tasks, with early works such as CLIP and ALIGN laying the foundation. Subsequent models, such as LLaVA and MiniGPT-4, extended these capabilities to images, while others adapted them for video understanding by focusing on temporal sequence modeling and constructing more robust datasets. Despite these developments, the scarcity of large, annotated long-form video datasets remains a significant hindrance to progress. Traditional short-form video tasks, like video question answering, captioning, and grounding, primarily require spatial or temporal understanding, whereas summarizing hour-long videos demands identifying key frames amidst substantial redundancy. While some models, such as LongVA and LLaVA-Video, can perform VQA on long videos, they struggle with summarization tasks due to data limitations.

    Researchers from Queen Mary University and Spotify introduce ViSMaP, an unsupervised method for summarising hour-long videos without requiring costly annotations. Traditional models perform well on short, pre-segmented videos but struggle with longer content where important events are scattered. ViSMaP bridges this gap by using LLMs and a meta-prompting strategy to iteratively generate and refine pseudo-summaries from clip descriptions created by short-form video models. The process involves three LLMs working in sequence for generation, evaluation, and prompt optimisation. ViSMaP achieves performance comparable to fully supervised models across multiple datasets while maintaining domain adaptability and eliminating the need for extensive manual labelling.

    The study addresses cross-domain video summarization by training on a labelled short-form video dataset and adapting to unlabelled, hour-long videos from a different domain. Initially, a model is trained to summarize 3-minute videos using TimeSFormer features, a visual-language alignment module, and a text decoder, optimized by cross-entropy and contrastive losses. To handle longer videos, they are segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting approach with multiple LLMs (generator, evaluator, optimizer) refines summaries. Finally, the model is fine-tuned on these pseudo-summaries using a symmetric cross-entropy loss to manage noisy labels and improve adaptation.

    The study evaluates VisMaP across three scenarios: summarization of long videos using Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2 datasets, and adaptation to short videos using EgoSchema. VisMaP, trained on hour-long videos, is compared against supervised and zero-shot methods, such as Video ReCap and LaViLa+GPT3.5, demonstrating competitive or superior performance without supervision. Evaluations use CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation studies highlight the benefits of meta-prompting and component modules, such as contrastive learning and SCE loss. Implementation details include the use of TimeSformer, DistilBERT, and GPT-2, with training conducted on an NVIDIA A100 GPU.

    In conclusion, ViSMaP is an unsupervised approach for summarizing long videos by utilizing annotated short-video datasets and a meta-prompting strategy. It first creates high-quality summaries through meta-prompting and then trains a summarization model, reducing the need for extensive annotations. Experimental results demonstrate that ViSMaP performs on par with fully supervised methods and adapts effectively across various video datasets. However, its reliance on pseudo labels from a source-domain model may impact performance under significant domain shifts. Additionally, ViSMaP currently relies solely on visual information. Future work could integrate multimodal data, introduce hierarchical summarization, and develop more generalizable meta-prompting techniques.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAlibaba Qwen Team Just Released Qwen3: The Latest Generation of Large Language Models in Qwen Series, Offering a Comprehensive Suite of Dense and Mixture-of-Experts (MoE) Models
    Next Article Customize Amazon Nova models to improve tool usage

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 28, 2025
    Machine Learning

    Zhipu AI Just Released GLM-4.5 Series: Redefining Open-Source Agentic AI with Hybrid Reasoning

    July 28, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    The death of spreadsheets: 6 reasons why AI will soon be the dominant business reporting tool

    News & Updates

    CVE-2025-6447 – “Simple Online Hotel Reservation System SQL Injection Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5689 – OpenSSH Root Group Privilege Escalation

    Common Vulnerabilities and Exposures (CVEs)

    Retrieving Command Parameters in Laravel Artisan

    Development

    Highlights

    CVE-2025-3708 – Le-Yan Le-Show Medical SQL Injection Vulnerability

    May 2, 2025

    CVE ID : CVE-2025-3708

    Published : May 2, 2025, 4:15 a.m. | 4 hours, 59 minutes ago

    Description : Le-show medical practice management system from Le-yan has a SQL Injection vulnerability, allowing unauthenticated remote attackers to inject arbitrary SQL commands to read, modify, and delete database contents.

    Severity: 9.8 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-4429 – “Gearside Developer Dashboard WordPress Plugin Reflected Cross-Site Scripting”

    May 30, 2025

    CVE-2025-6248 – Lenovo Browser Cross-Site Scripting Vulnerability

    July 17, 2025

    15 Angel Investors in Cybersecurity you should know in 2025

    May 13, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.