SmolVLA:面向經濟高效機器人的視覺-語言-動作模型
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
June 2, 2025
作者: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene
cs.AI
摘要
基於大規模多模態數據集預訓練的視覺-語言模型(VLMs)蘊含了豐富的視覺與語言知識,為機器人技術提供了堅實的基礎。近年來,研究者們不再從零開始訓練機器人策略,而是將VLMs改造成視覺-語言-動作(VLA)模型,實現了自然語言驅動的感知與控制。然而,現有的VLA模型通常規模龐大——參數量常達數十億——導致訓練成本高昂,且實際部署受限。此外,這些模型依賴於學術界和工業界的數據集,忽視了來自低成本機器人平台的社區收集數據日益增長的可用性。本研究提出了一種小型、高效且社區驅動的VLA模型——SmolVLA,它大幅降低了訓練與推理成本,同時保持了競爭力的性能。SmolVLA設計為可在單個GPU上訓練,並能部署於消費級GPU甚至CPU上。為進一步提升響應速度,我們引入了一種異步推理架構,將感知與動作預測與動作執行解耦,通過分塊動作生成實現更高的控制頻率。儘管體積小巧,SmolVLA的性能卻可與體積大十倍的VLA模型相媲美。我們在模擬及真實世界的多個機器人基準測試中評估了SmolVLA,並公開了所有代碼、預訓練模型及訓練數據。
English
Vision-language models (VLMs) pretrained on large-scale multimodal datasets
encode rich visual and linguistic knowledge, making them a strong foundation
for robotics. Rather than training robotic policies from scratch, recent
approaches adapt VLMs into vision-language-action (VLA) models that enable
natural language-driven perception and control. However, existing VLAs are
typically massive--often with billions of parameters--leading to high training
costs and limited real-world deployability. Moreover, they rely on academic and
industrial datasets, overlooking the growing availability of
community-collected data from affordable robotic platforms. In this work, we
present SmolVLA, a small, efficient, and community-driven VLA that drastically
reduces both training and inference costs, while retaining competitive
performance. SmolVLA is designed to be trained on a single GPU and deployed on
consumer-grade GPUs or even CPUs. To further improve responsiveness, we
introduce an asynchronous inference stack decoupling perception and action
prediction from action execution, allowing higher control rates with chunked
action generation. Despite its compact size, SmolVLA achieves performance
comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both
simulated as well as real-world robotic benchmarks and release all code,
pretrained models, and training data.