Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Top 15 Enterprise Use Cases That Justify Hiring Node.js Developers in 2025

      July 31, 2025

      The Core Model: Start FROM The Answer, Not WITH The Solution

      July 31, 2025

      AI-Generated Code Poses Major Security Risks in Nearly Half of All Development Tasks, Veracode Research Reveals   

      July 31, 2025

      Understanding the code modernization conundrum

      July 31, 2025

      Not just YouTube: Google is using AI to guess your age based on your activity – everywhere

      July 31, 2025

      Malicious extensions can use ChatGPT to steal your personal data – here’s how

      July 31, 2025

      What Zuckerberg’s ‘personal superintelligence’ sales pitch leaves out

      July 31, 2025

      This handy NordVPN tool flags scam calls on Android – even before you answer

      July 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Route Optimization through Laravel’s Shallow Resource Architecture

      July 31, 2025
      Recent

      Route Optimization through Laravel’s Shallow Resource Architecture

      July 31, 2025

      This Week in Laravel: Laracon News, Free Laravel Idea, and Claude Code Course

      July 31, 2025

      Everything We Know About Pest 4

      July 31, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      FOSS Weekly #25.31: Kernel 6.16, OpenMandriva Review, Conky Customization, System Monitoring and More

      July 31, 2025
      Recent

      FOSS Weekly #25.31: Kernel 6.16, OpenMandriva Review, Conky Customization, System Monitoring and More

      July 31, 2025

      Windows 11’s MSN Widgets board now opens in default browser, such as Chrome (EU only)

      July 31, 2025

      Microsoft’s new “move to Windows 11” campaign implies buying OneDrive paid plan

      July 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning

    Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning

    June 12, 2025

    Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield a modular foundation for intelligent physical agents.

    Scalable Self-Supervised Pretraining from 1M Hours of Video

    V-JEPA 2 is pretrained on over 1 million hours of internet-scale video combined with 1 million images. Using a visual mask denoising objective, the model learns to reconstruct masked spatiotemporal patches in a latent representation space. This approach avoids the inefficiencies of pixel-level prediction by focusing on predictable scene dynamics while disregarding irrelevant noise.

    To scale JEPA pretraining to this level, Meta researchers introduced four key techniques:

    • Data scaling: Constructed a 22M-sample dataset (VideoMix22M) from public sources like SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.
    • Model scaling: Expanded the encoder capacity to over 1B parameters using ViT-g.
    • Training schedule: Adopted a progressive resolution strategy and extended pretraining to 252K iterations.
    • Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips, reaching 64 frames at 384×384 resolution.

    These design choices led to an 88.2% average accuracy across six benchmark tasks—including SSv2, Diving-48, Jester, Kinetics, COIN, and ImageNet—surpassing previous baselines.

    Understanding via Masked Representation Learning

    V-JEPA 2 exhibits strong motion understanding capabilities. On the Something-Something v2 benchmark, it achieves 77.3% top-1 accuracy, outperforming models like InternVideo and VideoMAEv2. For appearance understanding, it remains competitive with state-of-the-art image-text pretraining models like DINOv2 and PEcoreG.

    The encoder’s representations were evaluated using attentive probes, verifying that self-supervised learning alone can yield transferable and domain-agnostic visual features applicable across diverse classification tasks.

    Temporal Reasoning via Video Question Answering

    To assess temporal reasoning, the V-JEPA 2 encoder is aligned with a multimodal large language model and evaluated on multiple video question-answering tasks. Despite lacking language supervision during pretraining, the model achieves:

    • 84.0% on PerceptionTest
    • 76.9% on TempCompass
    • 44.5% on MVP
    • 36.7% on TemporalBench
    • 40.3% on TOMATO

    These results challenge the assumption that visual-language alignment requires co-training from the start, demonstrating that a pretrained video encoder can be aligned post hoc with strong generalization.

    V-JEPA 2-AC: Learning Latent World Models for Robotic Planning

    A key innovation in this release is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Fine-tuned using only 62 hours of unlabeled robot video from the Droid dataset, V-JEPA 2-AC learns to predict future video embeddings conditioned on robot actions and poses. The architecture is a 300M parameter transformer with block-causal attention, trained using a teacher-forcing and rollout objective.

    This allows zero-shot planning through model-predictive control. The model infers action sequences by minimizing the distance between imagined future states and visual goals using the Cross-Entropy Method (CEM). It achieves high success in tasks such as reaching, grasping, and pick-and-place on unseen robot arms in different labs—without any reward supervision or additional data collection.

    Benchmarks: Robust Performance and Planning Efficiency

    Compared to baselines like Octo (behavior cloning) and Cosmos (latent diffusion world models), V-JEPA 2-AC:

    • Executes plans in ~16 seconds per step (versus 4 minutes for Cosmos).
    • Reaches a 100% success rate on reach tasks.
    • Outperforms others in grasp and manipulation tasks across object types.

    Notably, it operates using a monocular RGB camera without calibration or environment-specific fine-tuning, reinforcing the generalization capability of the learned world model.

    Conclusion

    Meta’s V-JEPA 2 represents a significant advancement in scalable self-supervised learning for physical intelligence. By decoupling observation learning from action conditioning and leveraging large-scale passive video, V-JEPA 2 demonstrates that general-purpose visual representations can be harnessed for both perception and control in the real world.


    Check out the Paper, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleDeveloper Spotlight: Robin Payot
    Next Article Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 31, 2025
    Machine Learning

    A Coding Guide to Build a Scalable Multi-Agent System with Google ADK

    July 31, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-23103 – Samsung Exynos Out-of-Bounds Write Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    RVTools Official Site Hacked to Deliver Bumblebee Malware via Trojanized Installer

    Development

    CVE-2025-4640 – “PCL Zlib Out-of-bounds Write Overflow”

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-52485 – DNN Cross-Site Scripting (XSS) Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-48281 – MyStyle Custom Product Designer SQL Injection

    June 9, 2025

    CVE ID : CVE-2025-48281

    Published : June 9, 2025, 4:15 p.m. | 25 minutes ago

    Description : Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’) vulnerability in mystyleplatform MyStyle Custom Product Designer allows Blind SQL Injection. This issue affects MyStyle Custom Product Designer: from n/a through 3.21.1.

    Severity: 9.3 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Funny Windows 11 bug brings back classic Windows boot sound from 20 years ago

    June 17, 2025

    CVE-2025-53031 – Oracle Financial Services Analytical Applications Infrastructure HTTP Unauthenticated Confidentiality Vulnerability

    July 16, 2025

    CVE-2025-46782 – Apache HTTP Server Unvalidated Request Parameter

    April 30, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.