SmolVLA：低コストかつ効率的なロボティクスのための視覚-言語-動作モデル

要旨

大規模なマルチモーダルデータセットで事前学習された視覚言語モデル（VLMs）は、豊富な視覚的および言語的知識を符号化しており、ロボティクスの強力な基盤となっている。ゼロからロボティクス政策を学習するのではなく、最近のアプローチでは、VLMsを視覚言語行動（VLA）モデルに適応させ、自然言語駆動の知覚と制御を可能にしている。しかし、既存のVLAsは通常、数十億のパラメータを持つ大規模なモデルであり、高い学習コストと実世界での展開可能性の限界を引き起こしている。さらに、それらは学術的および産業的なデータセットに依存しており、手頃な価格のロボティクスプラットフォームから収集されたコミュニティデータの増加を見落としている。本研究では、学習と推論のコストを大幅に削減しながら、競争力のある性能を維持する、小型で効率的なコミュニティ駆動型のVLAであるSmolVLAを提案する。SmolVLAは、単一のGPUで学習し、コンシューマーグレードのGPUやCPUに展開するように設計されている。さらに応答性を向上させるため、知覚と行動予測を行動実行から切り離す非同期推論スタックを導入し、チャンク化された行動生成により高い制御レートを実現している。そのコンパクトなサイズにもかかわらず、SmolVLAは10倍大きいVLAsと同等の性能を達成する。シミュレーションおよび実世界のロボティクスベンチマークでSmolVLAを評価し、すべてのコード、事前学習済みモデル、および学習データを公開する。

English

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

SmolVLA：低コストかつ効率的なロボティクスのための視覚-言語-動作モデル

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

要旨

Support