统一视觉-语言-动作模型

摘要

视觉-语言-动作模型（VLAs）因其在推动机器人操作方面的潜力而备受关注。然而，以往的方法主要依赖视觉-语言模型（VLMs）的通用理解能力来生成动作信号，往往忽视了视觉观察中蕴含的丰富时序和因果结构。本文提出了UniVLA，一个统一且原生的多模态VLA模型，它以自回归的方式将视觉、语言和动作信号建模为离散的令牌序列。这一表述使得灵活的多模态任务学习成为可能，特别是从大规模视频数据中学习。通过在后续训练中融入世界建模，UniVLA能够从视频中捕捉因果动态，从而有效迁移至下游策略学习——尤其是针对长期任务。我们的方法在多个广泛使用的模拟基准测试中，包括CALVIN、LIBERO和Simplenv-Bridge，均取得了新的最先进成果，显著超越了先前的方法。例如，UniVLA在LIBERO基准测试中实现了95.5%的平均成功率，超越了pi0-FAST的85.5%。我们进一步展示了其在现实世界ALOHA操作和自动驾驶中的广泛应用性。

English

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

统一视觉-语言-动作模型

Unified Vision-Language-Action Model

摘要

Support