Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 13, 2025

      This week in AI dev tools: Apple’s Foundations Model framework, Mistral’s first reasoning model, and more (June 13, 2025)

      June 13, 2025

      Open Talent platforms emerging to match skilled workers to needs, study finds

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      OneDrive for Mac will soon give you more flexible storage options

      June 13, 2025

      From The Editor’s Desk — new Windows Central community features, we’d like to hear from you!

      June 13, 2025

      New code strings attached to Xbox Game Pass suggests a price increase may be imminent

      June 13, 2025

      This could be the versatile laptop accessory I’ve been waiting for — Here’s why it stands out from other portable monitors

      June 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Worker Threads in Node.js: A Complete Guide for Multithreading in JavaScript

      June 13, 2025
      Recent

      Worker Threads in Node.js: A Complete Guide for Multithreading in JavaScript

      June 13, 2025

      Everybody’s gone lintin’

      June 13, 2025

      QAQ-QQ-AI-QUEST

      June 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      OneDrive for Mac will soon give you more flexible storage options

      June 13, 2025
      Recent

      OneDrive for Mac will soon give you more flexible storage options

      June 13, 2025

      From The Editor’s Desk — new Windows Central community features, we’d like to hear from you!

      June 13, 2025

      New code strings attached to Xbox Game Pass suggests a price increase may be imminent

      June 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

    Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

    June 3, 2025

    Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms—differences in morphology, sensors, and control modes—poses a further challenge to generalizability and cross-platform learning.

    Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework

    Hugging Face presents SmolVLA, a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs.

    A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools.

    Architectural Overview and Design Trade-Offs

    The SmolVLA model is structured into two primary components:

    • Perception Module (SmolVLM-2): A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features.
    • Action Expert: A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency.

    To reduce computational overhead, linear projections are used to align the modalities’ token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch’s JIT compilation for runtime optimization.

    Empirical Evaluation: Simulation and Real-World Performance

    SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions.

    In the LIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as π₀ (3.3B). In Meta-World, the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA’s smaller training footprint and absence of robotics-specific pretraining.

    In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks—outperforming both ACT (trained from scratch) and π₀ (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data.

    Performance Implications of Asynchronous Inference

    SmolVLA’s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance.

    Conclusion

    SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices—layer pruning, chunked action prediction, and asynchronous execution—SmolVLA maintains performance while significantly reducing computational demands.

    The model’s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data.


    Check out the Paper and Model on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOpenAI Introduces Four Key Updates to Its AI Agent Framework
    Next Article From Exploration Collapse to Predictable Limits: Shanghai AI Lab Proposes Entropy-Based Scaling Laws for Reinforcement Learning in LLMs

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 13, 2025
    Machine Learning

    Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

    June 13, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4159 – PCMan FTP Server GLOB Command Handler Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-1975 – Ollama Server Array Index Access Denial of Service Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-35471 – Conda Forge OpenSSL-Feedstock Local Privilege Escalation

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5641 – “Radare2 Memory Corruption Vulnerability in r_cons_is_breaked Function”

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Linux

    Rilasciata Fedora Asahi Remix 42: la distribuzione GNU/Linux ottimizzata per Mac con chip Apple Silicon

    April 16, 2025

    Fedora Asahi Remix rappresenta una delle soluzioni più avanzate per chi desidera utilizzare una distribuzione…

    CVE-2025-5620 – D-Link DIR-816 OS Command Injection Vulnerability

    June 4, 2025

    Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

    April 11, 2025

    CVE-2025-5571 – D-Link DCS-932L OS Command Injection

    June 4, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.