Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: Functionally, a Date

      September 16, 2025

      Creating Elastic And Bounce Effects With Expressive Animator

      September 16, 2025

      Microsoft shares Insiders preview of Visual Studio 2026

      September 16, 2025

      From Data To Decisions: UX Strategies For Real-Time Dashboards

      September 13, 2025

      DistroWatch Weekly, Issue 1139

      September 14, 2025

      Building personal apps with open source and AI

      September 12, 2025

      What Can We Actually Do With corner-shape?

      September 12, 2025

      Craft, Clarity, and Care: The Story and Work of Mengchu Yao

      September 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Can I use React Server Components (RSCs) today?

      September 16, 2025
      Recent

      Can I use React Server Components (RSCs) today?

      September 16, 2025

      Perficient Named among Notable Providers in Forrester’s Q3 2025 Commerce Services Landscape

      September 16, 2025

      Sarah McDowell Helps Clients Build a Strong AI Foundation Through Salesforce

      September 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      I Ran Local LLMs on My Android Phone

      September 16, 2025
      Recent

      I Ran Local LLMs on My Android Phone

      September 16, 2025

      DistroWatch Weekly, Issue 1139

      September 14, 2025

      sudo vs sudo-rs: What You Need to Know About the Rust Takeover of Classic Sudo Command

      September 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    July 31, 2025

    Estimated reading time: 5 minutes

    Table of contents

    • Introduction
    • The ThinkAct Framework
    • Experimental Results
    • Ablation Studies and Model Analysis
    • Implementation Details
    • Conclusion

    Introduction

    Embodied AI agents are increasingly being called upon to interpret complex, multimodal instructions and act robustly in dynamic environments. ThinkAct, presented by researchers from Nvidia and National Taiwan University, offers a breakthrough for vision-language-action (VLA) reasoning, introducing reinforced visual latent planning to bridge high-level multimodal reasoning and low-level robot control.

    Typical VLA models map raw visual and language inputs directly to actions through end-to-end training, which limits reasoning, long-term planning, and adaptability. Recent methods began to incorporate intermediate chain-of-thought (CoT) reasoning or attempt RL-based optimization, but struggled with scalability, grounding, or generalization when confronted with highly variable and long-horizon robotic manipulation tasks.

    The ThinkAct Framework

    Dual-System Architecture

    ThinkAct consists of two tightly integrated components:

    • Reasoning Multimodal LLM (MLLM): Performs structured, step-by-step reasoning over visual scenes and language instructions, outputting a visual plan latent that encodes high-level intent and planning context.
    • Action Model: A Transformer-based policy conditioned on the visual plan latent, executing the decoded trajectory as robot actions in the environment.

    This design allows asynchronous operation: the LLM “thinks” and generates plans at a slow cadence, while the action module carries out fine-grained control at higher frequency.

    Reinforced Visual Latent Planning

    A core innovation is the reinforcement learning (RL) approach leveraging action-aligned visual rewards:

    • Goal Reward: Encourages the model to align the start and end positions predicted in the plan with those in demonstration trajectories, supporting goal completion.
    • Trajectory Reward: Regularizes the predicted visual trajectory to closely match distributional properties of expert demonstrations using dynamic time warping (DTW) distance.

    Total reward rrr blends these visual rewards with a format correctness score, pushing the LLM to not only produce accurate answers but also plans that translate into physically plausible robot actions.

    Training Pipeline

    The multi-stage training procedure includes:

    1. Supervised Fine-Tuning (SFT): Cold-start with manually-annotated visual trajectory and QA data to teach trajectory prediction, reasoning, and answer formatting.
    2. Reinforced Fine-Tuning: RL optimization (using Group Relative Policy Optimization, GRPO) further incentivizes high-quality reasoning by maximizing the newly defined action-aligned rewards.
    3. Action Adaptation: The downstream action policy is trained using imitation learning, leveraging the frozen LLM’s latent plan output to guide control across varied environments.

    Inference

    At inference time, given an observed scene and a language instruction, the reasoning module generates a visual plan latent, which then conditions the action module to execute a full trajectory—enabling robust performance even in new, previously unseen settings.

    Experimental Results

    Robot Manipulation Benchmarks

    Experiments on SimplerEnv and LIBERO benchmarks demonstrate ThinkAct’s superiority:

    • SimplerEnv: Outperforms strong baselines (e.g., OpenVLA, DiT-Policy, TraceVLA) by 11–17% in various settings, especially excelling in long-horizon and visually diverse tasks.
    • LIBERO: Achieves the highest overall success rates (84.4%), excelling in spatial, object, goal, and long-horizon challenges, confirming its ability to generalize and adapt to novel skills and layouts.

    Embodied Reasoning Benchmarks

    On EgoPlan-Bench2, RoboVQA, and OpenEQA, ThinkAct demonstrates:

    • Superior multi-step and long-horizon planning accuracy.
    • State-of-the-art BLEU and LLM-based QA scores, reflecting improved semantic understanding and grounding for visual question answering tasks.

    Few-Shot Adaptation

    ThinkAct enables effective few-shot adaptation: with as few as 10 demonstrations, it achieves substantial success rate gains over other methods, highlighting the power of reasoning-guided planning for quickly learning new skills or environments.

    Self-Reflection and Correction

    Beyond task success, ThinkAct exhibits emergent behaviors:

    • Failure Detection: Recognizes execution errors (e.g., dropped objects).
    • Replanning: Automatically revises plans to recover and complete the task, thanks to reasoning on recent visual input sequences.

    Ablation Studies and Model Analysis

    • Reward Ablations: Both goal and trajectory rewards are essential for structured planning and generalization. Removing either significantly drops performance, and relying only on QA-style rewards limits multi-step reasoning capability.
    • Reduction in Update Frequency: ThinkAct achieves a balance between reasoning (slow, planning) and action (fast, control), allowing robust performance without excessive computational demand1.
    • Smaller Models: The approach generalizes to smaller MLLM backbones, maintaining strong reasoning and action capabilities.

    Implementation Details

    • Main backbone: Qwen2.5-VL 7B MLLM.
    • Datasets: Diverse robot and human demonstration videos (Open X-Embodiment, Something-Something V2), plus multimodal QA sets (RoboVQA, EgoPlan-Bench, Video-R1-CoT, etc.).
    • Uses a vision encoder (DINOv2), text encoder (CLIP), and a Q-Former for connecting reasoning output to action policy input.
    • Extensive experiments on real and simulated settings confirm scalability and robustness.

    Conclusion

    Nvidia’s ThinkAct sets a new standard for embodied AI agents, proving that reinforced visual latent planning—where agents “think before they act”—delivers robust, scalable, and adaptive performance in complex, real-world reasoning and robot manipulation tasks. Its dual-system design, reward shaping, and strong empirical results pave the way for intelligent, generalist robots capable of long-horizon planning, few-shot adaptation, and self-correction in diverse environments.


    Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    You may also like NVIDIA’s Open Sourced Cosmos DiffusionRenderer [Check it now]

    The post NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLangGraph Tutorial: A Step-by-Step Guide to Creating a Text Analysis Pipeline
    Next Article Automate the creation of handout notes using Amazon Bedrock Data Automation

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5778 – “ABC Courier Management System SQL Injection Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Smashing Security podcast #416: High street hacks, and Disney’s Wingdings woe

    Development

    Anti-Aliasing In CSS A Powerful Concept

    Web Development

    AIaaS vs. Traditional AI: Why Forward-Thinking Businesses Are Making the Switch⚖️

    Web Development

    Highlights

    CVE-2010-20115 – Arcane Software Vermillion FTP Daemon PORT Command Memory Corruption Vulnerability

    August 21, 2025

    CVE ID : CVE-2010-20115

    Published : Aug. 21, 2025, 9:15 p.m. | 4 hours, 4 minutes ago

    Description : Arcane Software’s Vermillion FTP Daemon (vftpd) versions up to and including 1.31 contains a memory corruption vulnerability triggered by a malformed FTP PORT command. The flaw arises from an out-of-bounds array access during input parsing, allowing an attacker to manipulate stack memory and potentially execute arbitrary code. Exploitation requires direct access to the FTP service and is constrained by a single execution attempt if the daemon is installed as a Windows service.

    Severity: 9.3 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Razer finally remembered I don’t live in China, so now we can all get this cool Gengar gaming headset

    August 25, 2025

    CVE-2025-25209 – Red Hat Connectivity Link Information Disclosure Vulnerability

    June 9, 2025

    CVE-2025-53023 – Oracle MySQL Server Replication High Privilege DOS Vulnerability

    July 16, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.