Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 11, 2025

      Credit Card Sins

      June 11, 2025

      Azul significantly cuts down on false positives in Java vulnerability detection with latest update to Azul Intelligence Cloud

      June 10, 2025

      The state of strategic portfolio management

      June 10, 2025

      Xbox Games Showcase reveals Indiana Jones and the Great Circle to get new DLC this fall

      June 10, 2025

      Tony Hawk’s Pro Skater 3 + 4 demo now is available now if you preorder — Michelangelo is coming too

      June 10, 2025

      “The full Call of Duty package”: Black Ops 7 has been confirmed during the Xbox Games Showcase, and no, it is not an expansion

      June 10, 2025

      Final Fantasy VII Remake and Final Fantasy XVI are FINALLY launching for Xbox consoles — One of which is available right now!

      June 10, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Neobrutalism CSS Button Generator

      June 11, 2025
      Recent

      Neobrutalism CSS Button Generator

      June 11, 2025

      How to Become a Node.js Backend Developer

      June 11, 2025

      Professor Rita McGrath on Leading in Times of Uncertainty

      June 11, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft mocks macOS 26 Liquid Design with Windows Aero throwback (Windows Vista)

      June 11, 2025
      Recent

      Microsoft mocks macOS 26 Liquid Design with Windows Aero throwback (Windows Vista)

      June 11, 2025

      Hindsite is a fast, lightweight static website generator

      June 11, 2025

      Orange Pi 5 Ultra and Max Single Board Computers Running Linux: Introduction

      June 11, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

    VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

    June 10, 2025

    Bridging Perception and Action in Robotics

    Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don’t just see and describe but also plan and move within their environments based on contextual understanding.

    Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments.

    Limitations of Prior VLA Models

    Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types.

    Introducing VeBrain: A Unified Multimodal Framework

    Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM’s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps.

    Technical Components: Architecture and Robotic Adapter

    To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot’s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as “turn” or “grasp,” to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations.

    Performance Evaluation Across Multimodal and Robotic Benchmarks

    VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL’s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and π0, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain’s ability to handle long-horizon and spatially complex control challenges with high reliability.

    Conclusion

    The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that’s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleFrom Text to Action: How Tool-Augmented AI Agents Are Redefining Language Models with Reasoning, Memory, and Autonomy
    Next Article Apple introduces a delightful and elegant new software design

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 11, 2025
    Machine Learning

    How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level

    June 11, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Design System In 90 Days

    Tech & Work

    Building a Fully-Featured 3D World in the Browser with Blender and Three.js

    News & Updates

    Your Google TV is getting a free feature upgrade – smart home users will love it

    News & Updates

    Mastering Node.js Streams: The Ultimate Guide to Memory-Efficient File Processing

    Development

    Highlights

    Why React Native Is the Best Choice for Startups on a Budget💡

    April 14, 2025

    Post Content Source: Read More 

    DistroWatch Weekly, Issue 1122

    May 18, 2025

    Streamline personalization development: How automated ML workflows accelerate Amazon Personalize implementation

    June 2, 2025

    Cisco fixes max severity IOS XE flaw letting attackers hijack devices

    May 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.