Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Last week in AI dev tools: Cloudflare blocking AI crawlers by default, Perplexity Max subscription, and more (July 7, 2025)

      July 7, 2025

      Infragistics Launches Ultimate 25.1 With Major Updates to App Builder, Ignite UI

      July 7, 2025

      Design Guidelines For Better Notifications UX

      July 7, 2025

      10 Top React.js Development Service Providers For Your Next Project In 2026

      July 7, 2025

      A million customer conversations with AI agents yielded this surprising lesson

      July 7, 2025

      Bookworms: Don’t skip this Kindle Paperwhite Essentials bundle that’s on sale

      July 7, 2025

      My favorite “non-gaming” gaming accessory is down to its lowest price for Prime Day | XREAL’s AR glasses give you a virtual cinema screen for Xbox Cloud Gaming, Netflix, PC gaming handhelds, and more

      July 7, 2025

      I’ve been using this Alienware 240Hz gaming monitor for 2 years — less than $1 a day if I’d bought it with this Prime Day discount

      July 7, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Token Limit – Monitor token usage in AI context files

      July 7, 2025
      Recent

      Token Limit – Monitor token usage in AI context files

      July 7, 2025

      Perficient Named a 2025 Best Place to Work in Orange County!

      July 7, 2025

      Perficient Recognized by Leading Analyst Firm for Automotive and Mobility Expertise

      July 7, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My favorite “non-gaming” gaming accessory is down to its lowest price for Prime Day | XREAL’s AR glasses give you a virtual cinema screen for Xbox Cloud Gaming, Netflix, PC gaming handhelds, and more

      July 7, 2025
      Recent

      My favorite “non-gaming” gaming accessory is down to its lowest price for Prime Day | XREAL’s AR glasses give you a virtual cinema screen for Xbox Cloud Gaming, Netflix, PC gaming handhelds, and more

      July 7, 2025

      I’ve been using this Alienware 240Hz gaming monitor for 2 years — less than $1 a day if I’d bought it with this Prime Day discount

      July 7, 2025

      Final Fantasy IX Remake possibly cancelled according to latest rumors — Which may be the saddest way to celebrate this legendary game’s 25th anniversary

      July 7, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»MMSearch-R1: End-to-End Reinforcement Learning for Active Image Search in LMMs

    MMSearch-R1: End-to-End Reinforcement Learning for Active Image Search in LMMs

    April 7, 2025

    Large Multimodal Models (LMMs) have demonstrated remarkable capabilities when trained on extensive visual-text paired data, advancing multimodal understanding tasks significantly. However, these models struggle with complex real-world knowledge, particularly long-tail information that emerges after training cutoffs or domain-specific knowledge restricted by privacy, copyright, or security concerns. When forced to operate beyond their internal knowledge boundaries, LMMs often produce hallucinations, severely compromising their reliability in scenarios where factual accuracy is paramount. While Retrieval-Augmented Generation (RAG) has been widely implemented to overcome these limitations, it introduces its challenges: the decoupled retrieval and generation components resist end-to-end optimisation, and its rigid “retrieve-then-generate” approach triggers unnecessary retrievals even when the model already possesses sufficient knowledge, resulting in increased latency and computational costs.

    Recent approaches have made significant strides in addressing knowledge limitations in large models. End-to-end reinforcement learning (RL) methods like OpenAI’s o-series, DeepSeek-R1, and Kimi K-1.5 have remarkably improved model reasoning capabilities. Simultaneously, Deep Research Models developed by major AI labs have shown that training models to interact directly with internet content substantially enhances their performance on complex real-world tasks. Despite these advances, challenges persist in efficiently integrating external knowledge retrieval with generation capabilities. Current methods either prioritize reasoning without optimized knowledge access or focus on retrieval mechanisms that aren’t seamlessly integrated with the model’s generation process. These approaches often fail to achieve the optimal balance between computational efficiency, response accuracy, and the ability to handle dynamic information, leaving significant room for improvement in creating truly adaptive and knowledge-aware multimodal systems.

    Researchers have attempted to explore an end-to-end RL framework to extend the capability boundaries of LMMs. And tried  to answer the following questions:

    (1) Can LMMs be trained to perceive their knowledge boundaries and learn to invoke search tools when necessary?

    (2) What are the effectiveness and efficiency of the RL approach?

    (3) Could the RL framework lead to the emergence of robust multimodal intelligent behaviors?

    This research introduces MMSearch-R1, which represents a pioneering approach to equip LMMs with active image search capabilities through an end-to-end reinforcement learning framework. This robust method focuses specifically on enhancing visual question answering (VQA) performance by enabling models to autonomously engage with image search tools. MMSearch-R1 trains models to make critical decisions about when to initiate image searches and how to effectively process the retrieved visual information. The system excels at extracting, synthesizing, and utilizing relevant visual data to support sophisticated reasoning processes. As a foundational advancement in multimodal AI, MMSearch-R1 enables LMMs to dynamically interact with external tools in a goal-oriented manner, significantly improving performance on knowledge-intensive and long-tail VQA tasks that traditionally challenge conventional models with their static knowledge bases.

    MMSearch-R1 employs a comprehensive architecture that combines sophisticated data engineering with advanced reinforcement learning techniques. The system builds upon the robust FactualVQA dataset, specifically constructed to provide unambiguous answers that can be reliably evaluated with automated methods. This dataset was created by extracting 50,000 Visual Concepts from both familiar and unfamiliar sections of the MetaCLIP metadata distribution, retrieving associated images, and using GPT-4o to generate factual question-answer pairs. After rigorous filtering and balancing processes, the dataset ensures an optimal mix of queries that can be answered with and without image search assistance.

    The reinforcement learning framework adapts the standard GRPO algorithm with multi-turn rollouts, integrating an advanced image search tool based on the veRL framework for end-to-end training. This image search capability combines SerpApi, JINA Reader for content extraction, and LLM-based summarization to retrieve and process relevant web content associated with images. The system employs a carefully calibrated reward function that balances answer correctness, proper formatting, and a mild penalty for tool usage, calculated as 0.9 × (Score – 0.1) + 0.1 × Format when image search is used, and 0.9 × Score + 0.1 × Format when it is not.

    Experimental results demonstrate MMSearch-R1’s significant performance advantages across multiple dimensions. Image search capabilities effectively expand the knowledge boundaries of Large Multimodal Models, with the system learning to make intelligent decisions about when to initiate searches while avoiding over-reliance on external tools. Both supervised fine-tuning (SFT) and reinforcement learning implementations show substantial performance improvements across in-domain FactualVQA testing and out-of-domain benchmarks, including InfoSeek, MMSearch, and Gimmick. Also, the models dynamically adjust their search rates based on visual content familiarity, maintaining efficient resource utilization while maximizing accuracy.

    Reinforcement learning demonstrates superior efficiency compared to supervised fine-tuning approaches. When applied directly to Qwen2.5-VL-Instruct-3B/7B models, GRPO achieves better results despite using only half the training data required by SFT methods. This remarkable efficiency highlights RL’s effectiveness in optimizing model performance with limited resources. The system’s ability to balance knowledge access with computational efficiency represents a significant advancement in creating more resource-conscious yet highly capable multimodal systems that can intelligently utilize external knowledge sources.

    MMSearch-R1 successfully demonstrates that outcome-based reinforcement learning can effectively train Large Multimodal Models with active image search capabilities. This approach enables models to autonomously decide when to utilize external visual knowledge sources while maintaining computational efficiency. The promising results establish a strong foundation for developing future tool-augmented, reasoning-capable LMMs that can dynamically interact with the visual world.


    Check out the Blog and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post MMSearch-R1: End-to-End Reinforcement Learning for Active Image Search in LMMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Step-by-Step Coding Guide to Building a Gemini-Powered AI Startup Pitch Generator Using LiteLLM Framework, Gradio, and FPDF in Google Colab with PDF Export Support
    Next Article Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 7, 2025
    Machine Learning

    The Geometries of Truth Are Orthogonal Across Tasks

    July 7, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-46011 – Listmonk SQL Injection Privilege Escalation

    Common Vulnerabilities and Exposures (CVEs)

    Frostpunk 2 heats up with a free “major content update” that overhauls the survival city builder’s core gameplay

    News & Updates

    Create a custom JavaScript sparkle cursor

    Web Development

    CVE-2025-53308 – Gopi_plus Image Slider Stored XSS CSRF

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Development

    Build and Deploy a Calendly Clone with Google Calendar Integration

    June 24, 2025

    If you’ve ever found yourself drowning in back-and-forth scheduling emails or confused by time zone…

    CVE-2025-53338 – Re.place CSRF Stored XSS

    June 27, 2025

    AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning

    June 18, 2025

    CVE-2025-47708 – Drupal Enterprise MFA – TFA CSRF Vulnerability

    May 14, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.