UniVLA:通过任务导向的潜在动作实现全域行动学习
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
May 9, 2025
作者: Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li
cs.AI
摘要
通用型机器人应能在多种环境中高效运作。然而,现有方法大多依赖大规模动作标注数据来提升能力,因此往往局限于单一物理规格,难以在不同实体和环境间迁移学习知识。为应对这些局限,我们提出了UniVLA,一个学习跨实体视觉-语言-动作(VLA)策略的新框架。我们的核心创新在于通过潜在动作模型从视频中提取任务中心化的动作表示,从而能够利用广泛实体和视角下的海量数据。为减少任务无关动态的影响,我们整合了语言指令,并在DINO特征空间内建立了潜在动作模型。通过互联网规模视频学习,这一通用策略可通过高效的潜在动作解码部署到各类机器人上。我们在多个操作与导航基准测试及实际机器人部署中取得了最先进的成果。UniVLA以不到OpenVLA 1/20的预训练计算量和1/10的下游数据量,实现了更优的性能。随着训练管道中引入异构数据,包括人类视频,持续的性能提升得以显现。这些结果凸显了UniVLA在促进可扩展且高效的机器人策略学习方面的潜力。
English
A generalist robot should perform effectively across various environments.
However, most existing approaches heavily rely on scaling action-annotated data
to enhance their capabilities. Consequently, they are often limited to single
physical specification and struggle to learn transferable knowledge across
different embodiments and environments. To confront these limitations, we
propose UniVLA, a new framework for learning cross-embodiment
vision-language-action (VLA) policies. Our key innovation is to derive
task-centric action representations from videos with a latent action model.
This enables us to exploit extensive data across a wide spectrum of embodiments
and perspectives. To mitigate the effect of task-irrelevant dynamics, we
incorporate language instructions and establish a latent action model within
the DINO feature space. Learned from internet-scale videos, the generalist
policy can be deployed to various robots through efficient latent action
decoding. We obtain state-of-the-art results across multiple manipulation and
navigation benchmarks, as well as real-robot deployments. UniVLA achieves
superior performance over OpenVLA with less than 1/20 of pretraining compute
and 1/10 of downstream data. Continuous performance improvements are observed
as heterogeneous data, even including human videos, are incorporated into the
training pipeline. The results underscore UniVLA's potential to facilitate
scalable and efficient robot policy learning.Summary
AI-Generated Summary