Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 14, 2025

      This week in AI dev tools: Apple’s Foundations Model framework, Mistral’s first reasoning model, and more (June 13, 2025)

      June 13, 2025

      Open Talent platforms emerging to match skilled workers to needs, study finds

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Here’s why network infrastructure is vital to maximizing your company’s AI adoption

      June 14, 2025

      The AI video tool behind the most viral social trends right now

      June 14, 2025

      Got a new password manager? How to clean up the password mess you left in the cloud

      June 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025
      Recent

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025

      Tunnel Run game in 170 lines of pure JS

      June 14, 2025

      Integrating Drupal with Salesforce SSO via SAML and Dynamic User Sync

      June 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025
      Recent

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Is Chrome Copying Edge? ‘Omnibox Tools’ Bring Edge-Style Address Bar Shortcuts

      June 14, 2025

      Windows 11 24H2’s new Start Menu auto-changes size based on screen resolution

      June 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs

    UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs

    April 29, 2025

    The CLIP framework has become foundational in multimodal representation learning, particularly for tasks such as image-text retrieval. However, it faces several limitations: a strict 77-token cap on text input, a dual-encoder design that separates image and text processing, and a limited compositional understanding that resembles bag-of-words models. These issues hinder its effectiveness in capturing nuanced, instruction-sensitive semantics. Although MLLMs like LLaVA, Qwen2-VL, and CogVLM offer significant advances in vision-language reasoning, their autoregressive next-token prediction objective restricts their ability to learn generalized, transferable embeddings. This has sparked growing interest in developing alternative methods that can combine the strengths of both contrastive learning and LLM-based reasoning.

    Recent approaches aim to overcome these limitations by employing novel architectures and training strategies. For instance, E5-V proposes unimodal contrastive training for aligning cross-modal features, while VLM2Vec introduces the MMEB benchmark to convert advanced vision-language models into effective embedding generators. Models like LLM2Vec and NV-Embed enhance text-based representation learning by modifying the attention mechanisms in decoder-only LLMs. Despite these innovations, challenges such as handling long sequences, enabling better cross-modal fusion, and effectively distinguishing hard negatives in contrastive learning remain. As multimodal applications expand, there is a pressing need for representation learning methods that are both scalable and capable of fine-grained semantic alignment.

    Researchers from institutions including The University of Sydney, DeepGlint, Tongyi Lab at Alibaba, and Imperial College London introduce UniME, a two-stage framework designed to improve multimodal representation learning using MLLMs. The first stage applies textual discriminative knowledge distillation from a strong LLM teacher to enhance the language encoder. The second stage employs hard negative enhanced instruction tuning, which involves filtering false negatives and sampling multiple challenging negatives per instance to improve the model’s discriminative and instruction-following abilities. Evaluations on the MMEB benchmark and various retrieval tasks show that UniME delivers consistent and significant improvements in both performance and compositional understanding.

    The UniME framework introduces a two-stage method for learning universal multimodal embeddings using MLLMs. First, it employs textual discriminative knowledge distillation, where a student MLLM is trained using text-only prompts and supervised by a teacher model to enhance embedding quality. Then, a second stage—hard negative enhanced instruction tuning—improves cross-modal alignment and task performance by filtering false negatives and sampling hard negatives. This stage also leverages task-specific prompts to enhance instruction-following for various applications, such as retrieval and visual question answering. Together, these stages significantly boost UniME’s performance on both in- and out-of-distribution tasks.

    The study evaluated UniME on Phi3.5-V and LLaVA-1.6 using PyTorch with DeepSpeed for efficient training across 8 NVIDIA A100 GPUs. Training consisted of two stages: a textual knowledge distillation phase using the NLI dataset (273,000 pairs) and a hard negative instruction tuning phase on 662,000 multimodal pairs. NV-Embed V2 served as the teacher model. UniME was evaluated on 36 MMEB benchmark datasets, achieving consistent improvements over baselines such as E5-V and VLM2Vec. Hard negatives significantly improved the model’s ability to distinguish subtle differences, thereby enhancing its performance, particularly in long-caption and compositional retrieval tasks. Ablation studies confirmed the effectiveness of both training stages and tuning parameters.

    In conclusion, UniME is a two-stage framework designed to improve multimodal representation learning using MLLMs. In the first stage, UniME distills textual discriminative knowledge from a large language model to strengthen the language embeddings of the MLLM. In the second stage, it enhances learning through instruction tuning with multiple hard negatives per batch, reducing false negative interference and encouraging the model to distinguish challenging examples. Extensive evaluation on MMEB and various retrieval tasks demonstrates that UniME consistently boosts performance, offering strong discriminative and compositional abilities across tasks, thereby surpassing the limitations of prior models, such as CLIP.


    Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow to Create a Custom Model Context Protocol (MCP) Client Using Gemini
    Next Article ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 14, 2025
    Machine Learning

    OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models

    June 14, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Laravel DomPDF: How to Add Page Numbers

    Development

    CVE-2025-48388 – FreeScout Format String Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    LeetCode Meditations: A Visualized Tour of DSA Concepts (A Handbook)

    Development

    CVE-2025-47781 – Rallly Token Brute Force Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Development

    Oracle Fusion new Product Management Landing Page and AI (25B)

    May 31, 2025

    We have been hearing about AI in many different ways and technology pioneers have been…

    Microsoft wants AI to read your browser history — but there’s one reason not to worry

    June 10, 2025

    CVE-2025-32002 – I-O DATA HDL-T Series OS Command Injection

    May 15, 2025

    CVE-2025-27031 – Cisco Router IOCTL Memory Corruption

    June 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.