Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Error’d: Pickup Sticklers

      September 27, 2025

      From Prompt To Partner: Designing Your Custom AI Assistant

      September 27, 2025

      Microsoft unveils reimagined Marketplace for cloud solutions, AI apps, and more

      September 27, 2025

      Design Dialects: Breaking the Rules, Not the System

      September 27, 2025

      Building personal apps with open source and AI

      September 12, 2025

      What Can We Actually Do With corner-shape?

      September 12, 2025

      Craft, Clarity, and Care: The Story and Work of Mengchu Yao

      September 12, 2025

      Cailabs secures €57M to accelerate growth and industrial scale-up

      September 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

      September 28, 2025
      Recent

      Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

      September 28, 2025

      Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

      September 28, 2025

      The first browser with JavaScript landed 30 years ago

      September 27, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured
      Recent
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation

    Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation

    August 11, 2025

    Embodied AI agents that can perceive, think, and act in the real world mark a key step toward the future of robotics. A central challenge is building scalable, reliable robotic manipulation, the skill of deliberately interacting with and controlling objects through selective contact. While progress spans analytic methods, model-based approaches, and large-scale data-driven learning, most systems still operate in disjoint stages of data collection, training, and evaluation. These stages often require custom setups, manual curation, and task-specific tweaks, creating friction that slows progress, hides failure patterns, and hampers reproducibility. This highlights the need for a unified framework to streamline learning and assessment. 

    Robotic manipulation research has progressed from analytical models to neural world models that learn dynamics directly from sensory inputs, using both pixel and latent spaces. Large-scale video generation models can produce realistic visuals but often lack action conditioning, long-term temporal consistency, and multi-view reasoning needed for control. Vision-language-action models follow instructions but are limited by imitation-based learning, preventing error recovery and planning. Policy evaluation remains challenging, as physics simulators require heavy tuning, and real-world testing is resource-intensive. Existing evaluation metrics often emphasize visual quality over task success, highlighting the need for benchmarks that better capture real-world manipulation performance. 

    The Genie Envisioner (GE), developed by researchers from AgiBot Genie Team, NUS LV-Lab, and BUAA, is a unified platform for robotic manipulation that combines policy learning, simulation, and evaluation in a video-generative framework. Its core, GE-Base, is a large-scale, instruction-driven video diffusion model capturing spatial, temporal, and semantic dynamics of real-world tasks. GE-Act maps these representations to precise action trajectories, while GE-Sim offers fast, action-conditioned video-based simulation. The EWMBench benchmark evaluates visual realism, physical accuracy, and instruction-action alignment. Trained on over a million episodes, GE generalizes across robots and tasks, enabling scalable, memory-aware, and physically grounded embodied intelligence research. 

    GE’s design unfolds in three key parts. GE-Base is a multi-view, instruction-conditioned video diffusion model trained on over 1 million robotic manipulation episodes. It learns latent trajectories that capture how scenes evolve under given commands. Building on that, GE-Act translates these latent video representations into real action signals via a lightweight, flow-matching decoder, offering quick, precise motor control even on robots not in the training data. GE-Sim repurposes GE-Base’s generative power into an action-conditioned neural simulator, enabling closed-loop, video-based rollout at speeds far beyond real hardware. The EWMBench suite then evaluates the system holistically across video realism, physical consistency, and alignment between instructions and resulting actions.

    In evaluations, Genie Envisioner showed strong real-world and simulated performance across varied robotic manipulation tasks. GE-Act achieved rapid control generation (54-step trajectories in 200 ms) and consistently outperformed leading vision-language-action baselines in both step-wise and end-to-end success rates. It adapted to new robot types, like Agilex Cobot Magic and Dual Franka, with only an hour of task-specific data, excelling in complex deformable object tasks. GE-Sim delivered high-fidelity, action-conditioned video simulations for scalable, closed-loop policy testing. The EWMBench benchmark confirmed GE-Base’s superior temporal alignment, motion consistency, and scene stability over state-of-the-art video models, aligning closely with human quality judgments. 

    In conclusion, Genie Envisioner is a unified, scalable platform for dual-arm robotic manipulation that merges policy learning, simulation, and evaluation into one video-generative framework. Its core, GE-Base, is an instruction-guided video diffusion model capturing the spatial, temporal, and semantic patterns of real-world robot interactions. GE-Act builds on this by converting these representations into precise, adaptable action plans, even on new robot types with minimal retraining. GE-Sim offers high-fidelity, action-conditioned simulation for closed-loop policy refinement, while EWMBench provides rigorous evaluation of realism, alignment, and consistency. Extensive real-world tests highlight the system’s superior performance, making it a strong foundation for general-purpose, instruction-driven embodied intelligence. 


    Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    🇬 Star us on GitHub
    🇷 Join our ML Subreddit
    🇸 Sponsor us

    The post Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleNuMind AI Releases NuMarkdown-8B-Thinking: A Reasoning Breakthrough in OCR and Document-to-Markdown Conversion
    Next Article Demystifying Amazon Bedrock Pricing for a Chatbot Assistant

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Cisco Issues Urgent Patch for Critical Unified CM Vulnerability (CVE-2025-20309)

    Development

    CVE-2025-36038: Critical RCE Vulnerability Discovered in IBM WebSphere Application Server

    Security

    Jenisis Island: The First Haunting

    Artificial Intelligence

    Addressing Misspecification in Simulation-based Inference through Data-driven Calibration

    Machine Learning

    Highlights

    Machine Learning

    Implementing An Airbnb and Excel MCP Server

    May 3, 2025

    In this tutorial, we’ll build an MCP server that integrates Airbnb and Excel, and connect…

    Over 40 Hacktivist Groups Target India in Coordinated Cyber Campaign: High Noise, Low Impact

    May 9, 2025

    Laravel Inertia Roles & Permissions: Breeze/Jetstream Examples

    June 17, 2025

    Visa and Mastercard Just Gave AI the Power to Shop and Pay for You

    May 1, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.