SmolVLA：面向经济高效机器人的视觉-语言-动作一体化模型

摘要

在大规模多模态数据集上预训练的视觉-语言模型（VLMs）蕴含了丰富的视觉与语言知识，为机器人技术奠定了坚实基础。近期研究不再从零开始训练机器人策略，而是将VLMs改造为视觉-语言-动作（VLA）模型，实现自然语言驱动的感知与控制。然而，现有的VLA模型通常规模庞大——参数数量常达数十亿——导致高昂的训练成本，限制了其在现实世界中的部署能力。此外，它们依赖学术与工业数据集，忽视了来自经济型机器人平台的社区收集数据日益增长的可用性。本研究中，我们提出了SmolVLA，一个轻量、高效且社区驱动的VLA模型，显著降低了训练与推理成本，同时保持了竞争力。SmolVLA设计为可在单GPU上训练，并部署于消费级GPU甚至CPU上。为进一步提升响应速度，我们引入了一种异步推理架构，将感知与动作预测与动作执行解耦，通过分块动作生成实现更高的控制频率。尽管体积小巧，SmolVLA在性能上可与规模大10倍的VLA模型相媲美。我们在模拟及真实世界的机器人基准测试中对SmolVLA进行了全面评估，并公开了所有代码、预训练模型及训练数据。

English

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

SmolVLA：面向经济高效机器人的视觉-语言-动作一体化模型

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

摘要

Support