VLA-0:无需修改构建顶尖视觉语言模型
VLA-0: Building State-of-the-Art VLAs with Zero Modification
October 15, 2025
作者: Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, Fabio Ramos
cs.AI
摘要
视觉-语言-动作模型(VLAs)在实现通用机器人操控方面展现出巨大潜力。然而,构建此类模型的最佳途径仍是一个开放性问题。当前方法往往增加了复杂性,例如通过动作标记修改现有视觉-语言模型(VLM)的词汇表,或引入专门的动作头部。有趣的是,直接将动作表示为文本这一最为简单的策略却鲜有探索。本研究提出VLA-0以探究这一理念。我们发现,VLA-0不仅有效,而且其表现之强令人惊讶。在恰当的设计下,VLA-0超越了更为复杂的模型。在评估VLAs的流行基准LIBERO上,VLA-0在相同机器人数据训练下,超越了包括pi_0.5-KI、OpenVLA-OFT和SmolVLA在内的所有现有方法。更进一步,即便没有大规模机器人专用数据的训练,它仍优于那些基于大规模机器人数据训练的方法,如pi_0.5-KI、pi_0、GR00T-N1和MolmoAct。这些发现同样适用于现实世界场景,VLA-0在此超越了基于大规模真实数据预训练的VLA模型SmolVLA。本文总结了我们的意外发现,并详细阐述了释放这一简洁而强大VLA设计高性能所需的具体技术。视觉结果、代码及训练模型可在此获取:https://vla0.github.io/。
English
Vision-Language-Action models (VLAs) hold immense promise for enabling
generalist robot manipulation. However, the best way to build them remains an
open question. Current approaches often add complexity, such as modifying the
existing vocabulary of a Vision-Language Model (VLM) with action tokens or
introducing special action heads. Curiously, the simplest strategy of
representing actions directly as text has remained largely unexplored. This
work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only
effective; it is surprisingly powerful. With the right design, VLA-0
outperforms more involved models. On LIBERO, a popular benchmark for evaluating
VLAs, VLA-0 outperforms all existing methods trained on the same robotic data,
including pi_0.5-KI, OpenVLA-OFT and SmolVLA. Furthermore, without
large-scale robotics-specific training, it outperforms methods trained on
large-scale robotic data, like pi_0.5-KI, pi_0, GR00T-N1 and MolmoAct.
These findings also translate to the real world, where VLA-0 outperforms
SmolVLA, a VLA model pre-trained on large-scale real data. This paper
summarizes our unexpected findings and spells out the specific techniques
required to unlock the high performance of this simple yet potent VLA design.
Visual results, code, and trained models are provided here:
https://vla0.github.io/.