Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Upwork Freelancers vs Dedicated React.js Teams: What’s Better for Your Project in 2025?

      August 1, 2025

      Is Agile dead in the age of AI?

      August 1, 2025

      Top 15 Enterprise Use Cases That Justify Hiring Node.js Developers in 2025

      July 31, 2025

      The Core Model: Start FROM The Answer, Not WITH The Solution

      July 31, 2025

      Finally, a sleek gaming laptop I can take to the office (without sacrificing power)

      August 1, 2025

      These jobs face the highest risk of AI takeover, according to Microsoft

      August 1, 2025

      Apple’s tariff costs and iPhone sales are soaring – how long until device prices are too?

      August 1, 2025

      5 ways to successfully integrate AI agents into your workplace

      August 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Enhancing Laravel Queries with Reusable Scope Patterns

      August 1, 2025
      Recent

      Enhancing Laravel Queries with Reusable Scope Patterns

      August 1, 2025

      Everything We Know About Livewire 4

      August 1, 2025

      Everything We Know About Livewire 4

      August 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      YouTube wants to use AI to treat “teens as teens and adults as adults” — with the most age-appropriate experiences and protections

      August 1, 2025
      Recent

      YouTube wants to use AI to treat “teens as teens and adults as adults” — with the most age-appropriate experiences and protections

      August 1, 2025

      Sam Altman is afraid of OpenAI’s GPT-5 creation — “The Manhattan Project feels very fast, like there are no adults in the room”

      August 1, 2025

      9 new features that arrived on the Windows 11 Insider Program during the second half of July 2025

      August 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    July 31, 2025

    Estimated reading time: 5 minutes

    Table of contents

    • Introduction
    • The ThinkAct Framework
    • Experimental Results
    • Ablation Studies and Model Analysis
    • Implementation Details
    • Conclusion

    Introduction

    Embodied AI agents are increasingly being called upon to interpret complex, multimodal instructions and act robustly in dynamic environments. ThinkAct, presented by researchers from Nvidia and National Taiwan University, offers a breakthrough for vision-language-action (VLA) reasoning, introducing reinforced visual latent planning to bridge high-level multimodal reasoning and low-level robot control.

    Typical VLA models map raw visual and language inputs directly to actions through end-to-end training, which limits reasoning, long-term planning, and adaptability. Recent methods began to incorporate intermediate chain-of-thought (CoT) reasoning or attempt RL-based optimization, but struggled with scalability, grounding, or generalization when confronted with highly variable and long-horizon robotic manipulation tasks.

    The ThinkAct Framework

    Dual-System Architecture

    ThinkAct consists of two tightly integrated components:

    • Reasoning Multimodal LLM (MLLM): Performs structured, step-by-step reasoning over visual scenes and language instructions, outputting a visual plan latent that encodes high-level intent and planning context.
    • Action Model: A Transformer-based policy conditioned on the visual plan latent, executing the decoded trajectory as robot actions in the environment.

    This design allows asynchronous operation: the LLM “thinks” and generates plans at a slow cadence, while the action module carries out fine-grained control at higher frequency.

    Reinforced Visual Latent Planning

    A core innovation is the reinforcement learning (RL) approach leveraging action-aligned visual rewards:

    • Goal Reward: Encourages the model to align the start and end positions predicted in the plan with those in demonstration trajectories, supporting goal completion.
    • Trajectory Reward: Regularizes the predicted visual trajectory to closely match distributional properties of expert demonstrations using dynamic time warping (DTW) distance.

    Total reward rrr blends these visual rewards with a format correctness score, pushing the LLM to not only produce accurate answers but also plans that translate into physically plausible robot actions.

    Training Pipeline

    The multi-stage training procedure includes:

    1. Supervised Fine-Tuning (SFT): Cold-start with manually-annotated visual trajectory and QA data to teach trajectory prediction, reasoning, and answer formatting.
    2. Reinforced Fine-Tuning: RL optimization (using Group Relative Policy Optimization, GRPO) further incentivizes high-quality reasoning by maximizing the newly defined action-aligned rewards.
    3. Action Adaptation: The downstream action policy is trained using imitation learning, leveraging the frozen LLM’s latent plan output to guide control across varied environments.

    Inference

    At inference time, given an observed scene and a language instruction, the reasoning module generates a visual plan latent, which then conditions the action module to execute a full trajectory—enabling robust performance even in new, previously unseen settings.

    Experimental Results

    Robot Manipulation Benchmarks

    Experiments on SimplerEnv and LIBERO benchmarks demonstrate ThinkAct’s superiority:

    • SimplerEnv: Outperforms strong baselines (e.g., OpenVLA, DiT-Policy, TraceVLA) by 11–17% in various settings, especially excelling in long-horizon and visually diverse tasks.
    • LIBERO: Achieves the highest overall success rates (84.4%), excelling in spatial, object, goal, and long-horizon challenges, confirming its ability to generalize and adapt to novel skills and layouts.

    Embodied Reasoning Benchmarks

    On EgoPlan-Bench2, RoboVQA, and OpenEQA, ThinkAct demonstrates:

    • Superior multi-step and long-horizon planning accuracy.
    • State-of-the-art BLEU and LLM-based QA scores, reflecting improved semantic understanding and grounding for visual question answering tasks.

    Few-Shot Adaptation

    ThinkAct enables effective few-shot adaptation: with as few as 10 demonstrations, it achieves substantial success rate gains over other methods, highlighting the power of reasoning-guided planning for quickly learning new skills or environments.

    Self-Reflection and Correction

    Beyond task success, ThinkAct exhibits emergent behaviors:

    • Failure Detection: Recognizes execution errors (e.g., dropped objects).
    • Replanning: Automatically revises plans to recover and complete the task, thanks to reasoning on recent visual input sequences.

    Ablation Studies and Model Analysis

    • Reward Ablations: Both goal and trajectory rewards are essential for structured planning and generalization. Removing either significantly drops performance, and relying only on QA-style rewards limits multi-step reasoning capability.
    • Reduction in Update Frequency: ThinkAct achieves a balance between reasoning (slow, planning) and action (fast, control), allowing robust performance without excessive computational demand1.
    • Smaller Models: The approach generalizes to smaller MLLM backbones, maintaining strong reasoning and action capabilities.

    Implementation Details

    • Main backbone: Qwen2.5-VL 7B MLLM.
    • Datasets: Diverse robot and human demonstration videos (Open X-Embodiment, Something-Something V2), plus multimodal QA sets (RoboVQA, EgoPlan-Bench, Video-R1-CoT, etc.).
    • Uses a vision encoder (DINOv2), text encoder (CLIP), and a Q-Former for connecting reasoning output to action policy input.
    • Extensive experiments on real and simulated settings confirm scalability and robustness.

    Conclusion

    Nvidia’s ThinkAct sets a new standard for embodied AI agents, proving that reinforced visual latent planning—where agents “think before they act”—delivers robust, scalable, and adaptive performance in complex, real-world reasoning and robot manipulation tasks. Its dual-system design, reward shaping, and strong empirical results pave the way for intelligent, generalist robots capable of long-horizon planning, few-shot adaptation, and self-correction in diverse environments.


    Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    You may also like NVIDIA’s Open Sourced Cosmos DiffusionRenderer [Check it now]

    The post NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLangGraph Tutorial: A Step-by-Step Guide to Creating a Text Analysis Pipeline
    Next Article Automate the creation of handout notes using Amazon Bedrock Data Automation

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 1, 2025
    Machine Learning

    TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

    August 1, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5904 – TOTOLINK T10 Buffer Overflow in POST Request Handler

    Common Vulnerabilities and Exposures (CVEs)

    Skywings Marketing – Best SEO Company in Laxmi Nagar, Delhi for Digital Success

    Web Development

    Spider – web app tool

    Linux

    CVE-2025-38351 – KVM Hyper-V Canonical GVA Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Your Oura Ring just got one of its biggest feature updates ever – for free

    May 6, 2025

    Oura has announced a new glucose integration with Dexcom’s Stelo and the permanent launch of…

    Google Docs Can Now Edit Encrypted Word Files But Only in Beta As Of Now

    May 21, 2025

    CVE-2025-5175 – Erdogant PyPickle Save Function Improper Authorization Critical Vulnerability

    May 26, 2025

    Perficient Colleagues Are Forging the Future

    May 21, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.