SimVLA：面向机器人操作的简易视觉语言动作基准模型

摘要

视觉-语言-动作模型已成为通用机器人操控领域的重要范式，其通过大规模预训练实现了卓越性能。随着空间先验知识的引入和多样化架构创新，该领域正快速发展。然而这些进展常伴随各异的训练方案与实现细节，导致难以厘清性能提升的确切来源。本文提出SimVLA这一精简基线模型，旨在为VLA研究建立透明参照系。通过严格分离感知与控制模块、采用标准视觉语言主干网络与轻量级动作头、统一关键训练动态，我们证明简约设计同样能实现顶尖性能。尽管仅包含5亿参数，SimVLA在标准仿真基准测试中无需机器人预训练即超越数十亿参数模型，在真实机器人实验中与pi0.5达到相当水平。本研究将SimVLA确立为稳健可复现的基准，为未来架构创新的效果归因提供清晰依据。项目网站：https://frontierrobo.github.io/SimVLA

English

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA

SimVLA：面向机器人操作的简易视觉语言动作基准模型

SimVLA: A Simple VLA Baseline for Robotic Manipulation

摘要

Support