Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      I replaced my Samsung OLED TV with this Sony Mini LED model for a week – and didn’t regret it

      July 29, 2025

      I tested the most popular robot mower on the market – and it was a $5,000 crash out

      July 29, 2025

      5 gadgets and accessories that leveled up my gaming setup (including a surprise console)

      July 29, 2025

      Why I’m patiently waiting for the Samsung Z Fold 8 next year (even though the foldable is already great)

      July 29, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Performance Analysis with Laravel’s Measurement Tools

      July 29, 2025
      Recent

      Performance Analysis with Laravel’s Measurement Tools

      July 29, 2025

      Memoization and Function Caching with this PHP Package

      July 29, 2025

      Laracon US 2025 Livestream

      July 29, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft mysteriously offered a Windows 11 upgrade to this unsupported Windows 10 PC — despite it failing to meet the “non-negotiable” TPM 2.0 requirement

      July 29, 2025
      Recent

      Microsoft mysteriously offered a Windows 11 upgrade to this unsupported Windows 10 PC — despite it failing to meet the “non-negotiable” TPM 2.0 requirement

      July 29, 2025

      With Windows 10’s fast-approaching demise, this Linux migration tool could let you ditch Microsoft’s ecosystem with your data and apps intact — but it’s limited to one distro

      July 29, 2025

      Windows 10 is 10 years old today — let’s look back at 10 controversial and defining moments in its history

      July 29, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

    VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

    July 27, 2025

    Embedding models act as bridges between different data modalities by encoding diverse multimodal information into a shared dense representation space. There have been advancements in embedding models in recent years, driven by progress in large foundation models. However, existing multimodal embedding models are trained on datasets such as MMEB and M-BEIR, with most focus only on natural images and photographs sourced from the MSCOCO, Flickr, and ImageNet datasets. These datasets fail to cover larger forms of visual information, including documents, PDFs, websites, videos, and slides. This causes existing embedding models to underperform on realistic tasks such as article searching, website searching, and YouTube video search.

    Multimodal embedding benchmarks such as MSCOCO, Flickr30K, and Conceptual Captions initially focused on static image-text pairs for tasks like image captioning and retrieval. More recent benchmarks, such as M-BEIR and MMEB, introduced multi-task evaluations, but remain limited to static images and short contexts. Video representation learning has evolved through models like VideoCLIP and VideoCoCa, integrating contrastive learning with captioning objectives. Visual document representation learning advanced through models like ColPali and VisRAG, which use VLMs for document retrieval. Unified modality retrieval methods like GME and Uni-Retrieval achieve strong performance on universal benchmarks. However, none can unify image, video, and visual document retrieval within a single framework.

    Researchers from Salesforce Research, UC Santa Barbara, University of Waterloo, and Tsinghua University have proposed VLM2Vec-V2 to unify image, video, and visual document retrieval within a single framework. Firstly, researchers developed MMEB-V2, a benchmark that extends MMEB with five new task types, including visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. Secondly, VLM2Vec-V2 serves as a general-purpose embedding model that supports multiple input modalities while demonstrating strong performance on both newly introduced tasks and original image benchmarks. This establishes a foundation for more scalable and flexible representation learning in both research and practical applications.

    VLM2Vec-V2 utilizes Qwen2-VL as its backbone, selected for its specialized capabilities in multimodal processing. Qwen2-VL offers three critical features that support unified embedding learning: Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-RoPE), and a unified framework that combines 2D and 3D convolutions. To enable effective multi-task training across diverse data sources, VLM2Vec-V2 introduces a flexible data sampling pipeline with two key components: (a) on-the-fly batch mixing based on predefined sampling weight tables that control the relative probabilities of each dataset, and (b) an interleaved sub-batching strategy that splits full batches into independently sampled sub-batches, improving the stability of contrastive learning.

    VLM2Vec-V2 achieves the highest overall average score of 58.0 across 78 datasets covering image, video, and visual document tasks, outperforming strong baselines including GME, LamRA, and VLM2Vec built on the same Qwen2-VL backbone. On image tasks, VLM2Vec-V2 outperforms most baselines by significant margins and achieves performance comparable to VLM2Vec-7B despite being only 2B parameters in size. For video tasks, the model achieves competitive performance despite training on relatively small amounts of video data. In visual document retrieval, VLM2Vec-V2 outperforms all VLM2Vec variants, but still lags behind ColPali, which is specifically optimized for visual document tasks.

    In conclusion, researchers introduced VLM2Vec-V2, a strong baseline model trained through contrastive learning across diverse tasks and modality combinations. VLM2Vec-V2 is built upon MMEB-V2 and uses Qwen2-VL as its backbone model. MMEB-V2 is a benchmark designed by researchers to assess multimodal embedding models across various modalities, including text, images, videos, and visual documents. The experimental evaluation demonstrates the effectiveness of VLM2Vec-V2 in achieving balanced performance across multiple modalities while highlighting the diagnostic value of MMEB-V2 for future research.


    Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuilding a Context-Aware Multi-Agent AI System Using Nomic Embeddings and Gemini LLM
    Next Article Key Factors That Drive Successful MCP Implementation and Adoption

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 29, 2025
    Machine Learning

    Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2023-51328 – PHPJabbers Cleaning Business Software Stored XSS

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2024-47056 – Mautic Sensitive Information Disclosure

    Common Vulnerabilities and Exposures (CVEs)

    Call of Duty: Black Ops 6 has a serious weed problem — but Activision’s double standards are more irritating

    News & Updates

    This Linux app makes sharing files to Android a breeze – here’s how

    News & Updates

    Highlights

    CVE-2025-5545 – Aluoxiang OA System Path Traversal Vulnerability

    June 3, 2025

    CVE ID : CVE-2025-5545

    Published : June 4, 2025, 12:15 a.m. | 2 hours, 8 minutes ago

    Description : A vulnerability classified as problematic has been found in aaluoxiang oa_system up to 5b445a6227b51cee287bd0c7c33ed94b801a82a5. This affects the function image of the file src/main/java/cn/gson/oasys/controller/process/ProcedureController.java. The manipulation leads to path traversal. It is possible to initiate the attack remotely. The exploit has been disclosed to the public and may be used. This product does not use versioning. This is why information about affected and unaffected releases are unavailable.

    Severity: 4.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-7894 – Onyx Chat Interface SQL Injection Vulnerability

    July 20, 2025

    Microsoft Teams adds social-style Storyline feed for individual updates

    July 1, 2025

    Building a custom text-to-SQL agent using Amazon Bedrock and Converse API

    June 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.