Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 12, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      OpenAI o3-pro available in the API, BrowserStack adds Playwright support for real iOS devices, and more – Daily News Digest

      June 12, 2025

      Creating The “Moving Highlight” Navigation Bar With JavaScript And CSS

      June 11, 2025

      Surface Pro 11 with Snapdragon X Elite drops to lowest price ever

      June 12, 2025

      With WH40K Boltgun and Dungeons of Hinterberg, this month’s Humble Choice lineup is stacked for less than $12

      June 12, 2025

      I’ve been loving the upgrade to my favorite mobile controller, and there’s even a version for large tablets

      June 12, 2025

      Copilot Vision just launched — and Microsoft already added new features

      June 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Master Data Management: The Key to Improved Analytics Reporting

      June 12, 2025
      Recent

      Master Data Management: The Key to Improved Analytics Reporting

      June 12, 2025

      Salesforce Lead-to-Revenue Management

      June 12, 2025

      React Native 0.80 – React 19.1, JS API Changes, Freezing Legacy Arch and much more

      June 12, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Surface Pro 11 with Snapdragon X Elite drops to lowest price ever

      June 12, 2025
      Recent

      Surface Pro 11 with Snapdragon X Elite drops to lowest price ever

      June 12, 2025

      With WH40K Boltgun and Dungeons of Hinterberg, this month’s Humble Choice lineup is stacked for less than $12

      June 12, 2025

      I’ve been loving the upgrade to my favorite mobile controller, and there’s even a version for large tablets

      June 12, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks

    This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks

    June 12, 2025

    Multimodal reasoning ability helps machines perform tasks such as solving math problems embedded in diagrams, reading signs from photographs, or interpreting scientific charts. The integration of both visual and linguistic information enables these systems to more closely mirror human thought processes, making them suitable for tasks that require visual interpretation combined with logical progression.

    A major challenge in this area is the inability of current systems to revisit specific parts of an image while reasoning dynamically. Traditional models usually begin by analyzing an image once and then proceed with the rest of the reasoning in pure text. This approach limits accuracy in situations that require revisiting the image to confirm a detail or extract new visual cues during mid-reasoning. These shortcomings are particularly pronounced in tasks that require fine-grained spatial awareness, such as identifying small labels in scientific documents or resolving ambiguities in visually complex scenes.

    Some tools and models have been introduced to address this gap, but they often treat visual grounding as a one-time operation. For example, existing systems like LLaVA-CoT or Qwen2.5-VL offer some visual-text integration. Still, they don’t let the model repeatedly and selectively query parts of an image based on the evolving reasoning process. The grounding, if performed, is generally static and lacks the flexibility to adapt based on intermediate reasoning steps. Moreover, these methods do not train models to determine the importance of specific image regions, leading to limitations in complex problem-solving.

    Researchers from Peking University, Alibaba Group, and ZEEKR Intelligent Technology have introduced a model called VLM-R³. This model tackles the challenge by allowing a more interactive connection between vision and reasoning. It equips the model with the capacity to determine when visual clarification is needed, identify the exact image region for analysis, and re-integrate this visual content into the reasoning process. This approach mimics human problem-solving, where one might zoom into a chart or revisit a paragraph to verify a detail before making a decision. The model’s structure emphasizes refining its decisions iteratively by relying on visual evidence throughout the reasoning process.

    To accomplish this, the researchers built a dataset named Visuo-Lingual Interleaved Rationale (VLIR), designed to train models in a stepwise interaction between images and text. VLM-R³ incorporates this dataset and operates using a method called Region-Conditioned Reinforcement Policy Optimization (R-GRPO). This training strategy encourages the model to selectively focus on informative parts of an image, perform transformations such as cropping or zooming, and incorporate those changes into subsequent logical steps. It simulates how humans shift their attention across different visual elements in response to their thoughts. The architecture integrates a pipeline that loops reasoning with visual inspection in real time, enhancing the system’s ability to interact with visual data during inference.

    The results demonstrate a strong performance across multiple benchmarks. On MathVista, the model reached 70.4%, an increase from 68.2% in the baseline. For MathVision, the improvement was from 25.1% to 30.2%. On ScienceQA, it posted a 14.3% improvement, reaching 87.9% over the baseline’s 73.6%. On the hallucination test (HallusionBench), the model achieved 62.0%, outperforming others like Mulberry, which scored 54.1%. VLM-R³ also showed superior results on document understanding in DocVQA with a 96.8% score. Comparisons showed that even though it uses fewer parameters than closed-source models like Gemini-2 Flash or GPT-4o, it delivers competitive accuracy, particularly in tasks requiring detailed visual analysis and interleaved reasoning.

    This work clearly outlines a problem that exists in how models handle vision during reasoning and presents a well-structured solution. By integrating a method for ongoing image analysis, researchers from the Alibaba Group, Peking University, and ZEEKR have advanced a powerful idea—models that look again, think, and refine. The proposed framework significantly improves accuracy in complex tasks and provides a blueprint for more robust, visually aware AI systems.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUsage of Cucumber HOOKS
    Next Article Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 12, 2025
    Machine Learning

    How VideoAmp uses Amazon Bedrock to power their media analytics interface

    June 12, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4219 – WordPress DPEPress Stored Cross-Site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    z/VSE: The forgotten operating system

    Databases

    This Persona 5 spin-off is finally coming to PC this summer

    News & Updates

    CVE-2025-3471 – “SureForms WordPress Plugin Unauthenticated Configuration Update”

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Redis Vulnerability Exposes Servers to Denial-of-Service Attacks

    April 23, 2025

    Redis Vulnerability Exposes Servers to Denial-of-Service Attacks

    A high-severity vulnerability has been discovered in Redis, the popular open-source in-memory data structure store, which could allow unauthenticated users to exhaust server memory and cause a Denial- …
    Read more

    Published Date:
    Apr 24, 2025 (1 hour, 39 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2025-21605

    CVE-2024-31449

    CVE-2023-41056

    CVE-2022-35951

    CodeSOD: The Pirate’s Code

    June 10, 2025

    Top 10 Best Dating Apps That Actually Work in 2025

    April 28, 2025

    CVE-2025-4631 – WordPress Profitori Plugin Privilege Escalation Vulnerability

    May 31, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.