统一视觉-语言-动作模型
Unified Vision-Language-Action Model
June 24, 2025
作者: Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang
cs.AI
摘要
视觉-语言-动作模型(VLAs)因其在推动机器人操作方面的潜力而备受关注。然而,以往的方法主要依赖视觉-语言模型(VLMs)的通用理解能力来生成动作信号,往往忽视了视觉观察中蕴含的丰富时序和因果结构。本文提出了UniVLA,一个统一且原生的多模态VLA模型,它以自回归的方式将视觉、语言和动作信号建模为离散的令牌序列。这一表述使得灵活的多模态任务学习成为可能,特别是从大规模视频数据中学习。通过在后续训练中融入世界建模,UniVLA能够从视频中捕捉因果动态,从而有效迁移至下游策略学习——尤其是针对长期任务。我们的方法在多个广泛使用的模拟基准测试中,包括CALVIN、LIBERO和Simplenv-Bridge,均取得了新的最先进成果,显著超越了先前的方法。例如,UniVLA在LIBERO基准测试中实现了95.5%的平均成功率,超越了pi0-FAST的85.5%。我们进一步展示了其在现实世界ALOHA操作和自动驾驶中的广泛应用性。
English
Vision-language-action models (VLAs) have garnered significant attention for
their potential in advancing robotic manipulation. However, previous approaches
predominantly rely on the general comprehension capabilities of vision-language
models (VLMs) to generate action signals, often overlooking the rich temporal
and causal structure embedded in visual observations. In this paper, we present
UniVLA, a unified and native multimodal VLA model that autoregressively models
vision, language, and action signals as discrete token sequences. This
formulation enables flexible multimodal tasks learning, particularly from
large-scale video data. By incorporating world modeling during post-training,
UniVLA captures causal dynamics from videos, facilitating effective transfer to
downstream policy learning--especially for long-horizon tasks. Our approach
sets new state-of-the-art results across several widely used simulation
benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly
surpassing previous methods. For example, UniVLA achieves 95.5% average success
rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate
its broad applicability on real-world ALOHA manipulation and autonomous
driving.