UniVLA:以任務為核心的潛在動作學習,實現無處不在的行動能力
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
May 9, 2025
作者: Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li
cs.AI
摘要
通用型機器人應能在各種環境中高效運作。然而,現有方法大多依賴於擴展帶有動作標註的數據來提升其能力,這導致它們通常局限於單一的物理規格,難以在不同實體和環境間學習可遷移的知識。為應對這些限制,我們提出了UniVLA,一個用於學習跨實體視覺-語言-動作(VLA)策略的新框架。我們的核心創新在於從視頻中提取任務中心化的動作表示,通過潛在動作模型實現。這使得我們能夠利用廣泛的數據,涵蓋多種實體和視角。為了減輕任務無關動態的影響,我們整合了語言指令,並在DINO特徵空間內建立了潛在動作模型。通過從互聯網規模的視頻中學習,這一通用策略可通過高效的潛在動作解碼部署到各類機器人上。我們在多個操作和導航基準測試以及實際機器人部署中取得了領先的成果。UniVLA以不到OpenVLA 1/20的預訓練計算量和1/10的下游數據量,實現了更優的性能。隨著異構數據(甚至包括人類視頻)被納入訓練流程,我們觀察到持續的性能提升。這些結果凸顯了UniVLA在促進可擴展且高效的機器人策略學習方面的潛力。
English
A generalist robot should perform effectively across various environments.
However, most existing approaches heavily rely on scaling action-annotated data
to enhance their capabilities. Consequently, they are often limited to single
physical specification and struggle to learn transferable knowledge across
different embodiments and environments. To confront these limitations, we
propose UniVLA, a new framework for learning cross-embodiment
vision-language-action (VLA) policies. Our key innovation is to derive
task-centric action representations from videos with a latent action model.
This enables us to exploit extensive data across a wide spectrum of embodiments
and perspectives. To mitigate the effect of task-irrelevant dynamics, we
incorporate language instructions and establish a latent action model within
the DINO feature space. Learned from internet-scale videos, the generalist
policy can be deployed to various robots through efficient latent action
decoding. We obtain state-of-the-art results across multiple manipulation and
navigation benchmarks, as well as real-robot deployments. UniVLA achieves
superior performance over OpenVLA with less than 1/20 of pretraining compute
and 1/10 of downstream data. Continuous performance improvements are observed
as heterogeneous data, even including human videos, are incorporated into the
training pipeline. The results underscore UniVLA's potential to facilitate
scalable and efficient robot policy learning.Summary
AI-Generated Summary