Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 12, 2025

      Creating The “Moving Highlight” Navigation Bar With JavaScript And CSS

      June 11, 2025

      Databricks adds new tools like Lakebase, Lakeflow Designer, and Agent Bricks to better support building AI apps and agents in the enterprise

      June 11, 2025

      Zencoder launches end-to-end UI testing agent

      June 11, 2025

      NVIDIA chief rebuffs Anthropic’s AI slashing 50% of entry-level white collar jobs from Gen Z claim: “He thinks AI is so scary, but only they should do it.”

      June 12, 2025

      OpenAI shifts to Google for cloud computing support as Microsoft partnership falters, despite Sam Altman’s “compute-sufficient” claim

      June 12, 2025

      Clair Obscur: Expedition 33 now lets you rematch the game’s most brutal boss

      June 12, 2025

      The Alters PC review: I’m rethinking my own life paths after falling in love with a sci-fi game

      June 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      SVAR Svelte Filter: Visual Query Builder for Data-Driven Apps

      June 12, 2025
      Recent

      SVAR Svelte Filter: Visual Query Builder for Data-Driven Apps

      June 12, 2025

      Developing a Serverless Blogging Platform with AWS Lambda and Python

      June 12, 2025

      YAML files in DBT

      June 12, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      NVIDIA chief rebuffs Anthropic’s AI slashing 50% of entry-level white collar jobs from Gen Z claim: “He thinks AI is so scary, but only they should do it.”

      June 12, 2025
      Recent

      NVIDIA chief rebuffs Anthropic’s AI slashing 50% of entry-level white collar jobs from Gen Z claim: “He thinks AI is so scary, but only they should do it.”

      June 12, 2025

      OpenAI shifts to Google for cloud computing support as Microsoft partnership falters, despite Sam Altman’s “compute-sufficient” claim

      June 12, 2025

      Clair Obscur: Expedition 33 now lets you rematch the game’s most brutal boss

      June 12, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

    VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

    June 10, 2025

    Bridging Perception and Action in Robotics

    Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don’t just see and describe but also plan and move within their environments based on contextual understanding.

    Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments.

    Limitations of Prior VLA Models

    Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types.

    Introducing VeBrain: A Unified Multimodal Framework

    Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM’s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps.

    Technical Components: Architecture and Robotic Adapter

    To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot’s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as “turn” or “grasp,” to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations.

    Performance Evaluation Across Multimodal and Robotic Benchmarks

    VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL’s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and π0, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain’s ability to handle long-horizon and spatially complex control challenges with high reliability.

    Conclusion

    The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that’s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleFrom Text to Action: How Tool-Augmented AI Agents Are Redefining Language Models with Reasoning, Memory, and Autonomy
    Next Article Apple introduces a delightful and elegant new software design

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 12, 2025
    Machine Learning

    CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs

    June 12, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Amazon Q Developer gets new agentic coding experience in Visual Studio Code

    Tech & Work

    CVE-2025-48366 – Group-Office Stored Blind XSS Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4986 – “3DEXPERIENCE Product Manager Stored XSS”

    Common Vulnerabilities and Exposures (CVEs)

    Role of AI-driven Autonomous Testing in Software QA

    Development

    Highlights

    CVE-2025-48346 – Etsy360 Embed and Integrate Etsy Shop Missing Authorization Vulnerability

    May 19, 2025

    CVE ID : CVE-2025-48346

    Published : May 19, 2025, 3:15 p.m. | 1 hour, 13 minutes ago

    Description : Missing Authorization vulnerability in Etsy360 Embed and Integrate Etsy Shop allows Accessing Functionality Not Properly Constrained by ACLs. This issue affects Embed and Integrate Etsy Shop: from n/a through 1.0.4.

    Severity: 5.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Is this the end of Intel-based Macs? Apple confirms bittersweet update policy for MacOS

    June 9, 2025

    GraphCast: AI model for faster and more accurate global weather forecasting

    May 29, 2025

    CVE-2025-20975 – AODService Android Activity Hijacking Vulnerability

    May 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.