Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      15 Essential Skills to Look for When Hiring Node.js Developers for Enterprise Projects (2025-2026)

      August 4, 2025

      African training program creates developers with cloud-native skills

      August 4, 2025

      React.js for SaaS Platforms: How Top Development Teams Help Startups Launch Faster

      August 3, 2025

      Upwork Freelancers vs Dedicated React.js Teams: What’s Better for Your Project in 2025?

      August 1, 2025

      LastPass can now warn or block logins to shadow SaaS apps – here’s how

      August 4, 2025

      Get up to a year of Adobe Creative Cloud access for 40% off

      August 4, 2025

      Got 6 hours? This free AI training from Google and Goodwill can boost your resume today

      August 4, 2025

      Why I recommend this budget phone with a paper-like screen over ‘minimalist’ devices

      August 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Boost, your AI coding starter kit

      August 4, 2025
      Recent

      Laravel Boost, your AI coding starter kit

      August 4, 2025

      Using GitHub Copilot in VS Code

      August 4, 2025

      Optimizely Mission Control – Part I

      August 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Top 20 kubectl Commands Every Kubernetes Beginner Must Know

      August 4, 2025
      Recent

      Top 20 kubectl Commands Every Kubernetes Beginner Must Know

      August 4, 2025

      Microsoft’s record stock run collides with Nadella’s admission that 15,000 layoffs still ‘hurt’

      August 4, 2025

      Microsoft and Adobe Power Up Fantasy Premier League Fans with AI – Here’s How

      August 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ReVisual-R1: An Open-Source 7B Multimodal Large Language Model (MLLMs) that Achieves Long, Accurate and Thoughtful Reasoning

    ReVisual-R1: An Open-Source 7B Multimodal Large Language Model (MLLMs) that Achieves Long, Accurate and Thoughtful Reasoning

    June 19, 2025

    The Challenge of Multimodal Reasoning

    Recent breakthroughs in text-based language models, such as DeepSeek-R1, have demonstrated that RL can aid in developing strong reasoning skills. Motivated by this, researchers have attempted to apply the same RL techniques to MLLMs to enhance their ability to reason across both visual and textual inputs. However, these attempts haven’t been entirely successful; MLLMs still struggle with complex reasoning tasks. This suggests that simply reusing RL strategies from text-only models may not work well in multimodal settings, where the interaction between different data types introduces new challenges that require more tailored approaches. 

    Evolution of Multimodal Language Models

    Recent research in MLLMs builds on the progress of LLMs by combining visual inputs with language understanding. Early models, such as CLIP and MiniGPT-4, laid the groundwork, followed by instruction-tuned models like LLaMA. While closed-source models demonstrate strong reasoning through lengthy CoT outputs, open-source models have primarily focused on fine-tuning and CoT adaptations. However, these often yield brief answers that limit in-depth rationale. RL, including techniques like RLHF and GRPO, has shown promise for enhancing reasoning in LLMs. Inspired by this, recent work now aims to apply RL in MLLMs to improve visual reasoning and support richer, longer outputs. 

    Introduction of ReVisual-R1

    Researchers from Tsinghua University, Shanghai Jiao Tong University, and the Shanghai Artificial Intelligence Laboratory have introduced ReVisual-R1, a 7B-parameter open-source MLLM that sets a new standard in multimodal reasoning. Their study reveals three key insights: (1) Careful text-only pretraining provides a strong cold-start, outperforming many existing MLLMs even before RL; (2) The commonly used GRPO algorithm suffers from gradient stagnation, which they address with a novel method called Prioritized Advantage Distillation (PAD); and (3) Adding a final text-only RL phase after multimodal RL further enhances reasoning. Their three-stage approach, which includes text pretraining, multimodal RL, and final text RL, strikes an effective balance between visual grounding and deep cognitive reasoning. 

    Developing the GRAMMAR Dataset

    The GRAMMAR dataset was developed after it was noticed that existing multimodal cold-start datasets lack the depth necessary to train strong reasoning models. Text-only datasets, like DeepMath, showed better gains in both text and multimodal tasks, suggesting that textual complexity better stimulates reasoning. To address this, GRAMMAR combines diverse textual and multimodal samples using a multi-stage curation process. This data fuels the Staged Reinforcement Optimization (SRO) framework, which first trains models using multimodal RL, enhanced by Prioritized Advantage Distillation to avoid stalled learning and an efficient-length reward to curb verbosity, followed by a text-only RL phase to boost reasoning and language fluency. 

    Three-Stage Training Pipeline

    The experiments for ReVisual-R1 followed a structured three-stage training process: starting with pure text data to build a language foundation, then incorporating multimodal reinforcement learning for visual-text reasoning, and finally fine-tuning with text-only RL to refine reasoning and fluency. It was tested across various benchmarks and outperformed both open-source and some commercial models in multimodal and math reasoning tasks. The model achieved top results on 9 out of 10 benchmarks. Ablation studies confirmed the importance of training order and the Prioritized Advantage Distillation method, which helped focus learning on high-quality responses, resulting in a significant improvement in overall performance. 

    Summary and Contributions

    In conclusion, ReVisual-R1 is a 7B open-source MLLM built to tackle the challenges of complex multimodal reasoning. Instead of relying solely on scale, it uses a well-designed three-stage training process: starting with high-quality text data for foundational rationale, followed by a multimodal RL phase enhanced with a new PAD technique for stability, and ending with a final text-based RL refinement. This thoughtful curriculum significantly boosts performance. ReVisual-R1 sets a new benchmark among 7B models, excelling in tasks like MathVerse and AIME. The work highlights how structured training can unlock deeper reasoning in MLLMs. 


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post ReVisual-R1: An Open-Source 7B Multimodal Large Language Model (MLLMs) that Achieves Long, Accurate and Thoughtful Reasoning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOpenAI Releases an Open‑Sourced Version of a Customer Service Agent Demo with the Agents SDK
    Next Article HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 4, 2025
    Machine Learning

    Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

    August 4, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-49852 – ControlID iDSecure Server-Side Request Forgery

    Common Vulnerabilities and Exposures (CVEs)

    T-Mobile settlement payouts begin this month – how much you could get

    News & Updates

    eslint-plugin-mutate

    Development

    CVE-2022-21150 – Apache Struts Deserialization Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-32873 – Django Slow Denial-of-Service Vulnerability in HTML Tag Processing

    May 8, 2025

    CVE ID : CVE-2025-32873

    Published : May 8, 2025, 4:17 a.m. | 2 hours, 21 minutes ago

    Description : An issue was discovered in Django 4.2 before 4.2.21, 5.1 before 5.1.9, and 5.2 before 5.2.1. The django.utils.html.strip_tags() function is vulnerable to a potential denial-of-service (slow performance) when processing inputs containing large sequences of incomplete HTML tags. The template filter striptags is also vulnerable, because it is built on top of strip_tags().

    Severity: 5.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    How to Copy Objects in Python

    April 17, 2025

    You can get a Snapdragon X-powered laptop for under $500 right now — a low I didn’t think we’d see this Prime Day week

    July 9, 2025

    Windows 10 KB5060533 adds Bing feature to Calendar UI on taskbar

    June 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.