Advancing Egocentric Video Question Answering with Multimodal Large Language Models

June 26, 2025

Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2—a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate
annotation noise…

Source: Read MoreÂ

Previous ArticleCommon Accessibility Issues: Real Bugs from Real Testing

Next Article From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts

Coded Smorgasbord: High Strung

Chainguard launches trusted collection of verified JavaScript libraries

CData launches Connect AI to provide agents access to enterprise data sources

PostgreSQL 18 adds asynchronous I/O to improve performance

Distribution Release: Neptune 9.0

Distribution Release: Kali Linux 2025.3

Distribution Release: SysLinuxOS 13

Development Release: MX Linux 25 Beta 1

PHP 8.5.0 RC 1 available for testing

PHP 8.5.0 RC 1 available for testing

Terraform Code Generator Using Ollama and CodeGemma

Beyond Denial: How AI Concierge Services Can Transform Healthcare from Reactive to Proactive

Distribution Release: Neptune 9.0

Distribution Release: Neptune 9.0

FOSS Weekly #25.39: Kill Switch Phones, LMDE 7, Zorin OS 18 Beta, Polybar, Apt History and More Linux Stuff

Distribution Release: Kali Linux 2025.3

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

The latest Minecraft beta basically gives us customizable loadouts, and turns Copper Golems into statues

Windows 11 Build 22631.5696 rolls out to Beta channel with important fixes

My first 24 hours with the Galaxy Z Fold 7 left me completely mesmerized

Developing reliable AI tools for healthcare

CVE-2025-51052 – Vedo Suite Path Traversal Vulnerability

Rilasciato Vivaldi 7.4: aggiornamento del browser per GNU/Linux e altre piattaforme

Mercury foundation models from Inception Labs are now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

CVE-2025-45861 – TOTOLINK A3002R Buffer Overflow Vulnerability

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Related Posts