SimVLA: ロボットマニピュレーションのためのシンプルなVLAベースライン

要旨

Vision-Language-Action（VLA）モデルは、大規模事前学習を活用して高い性能を達成する一般目的ロボットマニピュレーションの有望なパラダイムとして登場した。この分野は、空間事前情報の追加や多様なアーキテクチャの革新により急速に進化している。しかし、これらの進展には往々にして異なる訓練レシピや実装詳細が伴い、実験的な性能向上の正確な要因を特定することを困難にしている。本研究では、VLA研究の透明性のある参照点を確立するために、合理化されたベースラインSimVLAを提案する。知覚と制御を厳密に分離し、標準的な視覚言語バックボーンと軽量なアクションヘッドを使用し、重要な訓練ダイナミクスを標準化することで、最小限の設計が最先端の性能を達成できることを実証する。パラメータ数がわずか0.5Bであるにもかかわらず、SimVLAはロボット事前学習なしで標準シミュレーションベンチマークにおいて数十億パラメータモデルを上回る性能を示す。また実ロボット評価ではpi0.5と同等の性能に達する。本結果は、SimVLAが将来のアーキテクチャ革新による実験的成果を明確に帰属可能にする、堅牢で再現性の高いベースラインであることを示す。ウェブサイト: https://frontierrobo.github.io/SimVLA

English

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA

SimVLA: ロボットマニピュレーションのためのシンプルなVLAベースライン

SimVLA: A Simple VLA Baseline for Robotic Manipulation

要旨

Support