视频中的潜在动作预训练
Latent Action Pretraining from Videos
October 15, 2024
作者: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo
cs.AI
摘要
我们介绍了Latent Action Pretraining for general Action models (LAPA),这是一种无监督的方法,用于预训练Vision-Language-Action (VLA)模型,而无需地面真实机器人动作标签。现有的Vision-Language-Action模型通常需要在预训练期间由人类远程操作员收集的动作标签,这显著限制了可能的数据来源和规模。在这项工作中,我们提出了一种方法,可以从没有机器人动作标签的互联网规模视频中学习。我们首先训练一个动作量化模型,利用基于VQ-VAE的目标来学习图像帧之间的离散潜在动作,然后预训练一个潜在的VLA模型,以从观察和任务描述中预测这些潜在动作,最后在小规模机器人操作数据上微调VLA,将潜在动作映射到机器人动作。实验结果表明,我们的方法明显优于现有的从大规模视频中训练机器人操作策略的技术。此外,它在需要语言条件、泛化到未见对象以及语义泛化到未见指令的真实世界操作任务上,也优于使用机器人动作标签训练的最先进VLA模型。仅在人类操作视频上训练也表现出积极的迁移效果,为利用网络规模数据进行机器人基础模型打开了潜力。
English
We introduce Latent Action Pretraining for general Action models (LAPA), an
unsupervised method for pretraining Vision-Language-Action (VLA) models without
ground-truth robot action labels. Existing Vision-Language-Action models
require action labels typically collected by human teleoperators during
pretraining, which significantly limits possible data sources and scale. In
this work, we propose a method to learn from internet-scale videos that do not
have robot action labels. We first train an action quantization model
leveraging VQ-VAE-based objective to learn discrete latent actions between
image frames, then pretrain a latent VLA model to predict these latent actions
from observations and task descriptions, and finally finetune the VLA on
small-scale robot manipulation data to map from latent to robot actions.
Experimental results demonstrate that our method significantly outperforms
existing techniques that train robot manipulation policies from large-scale
videos. Furthermore, it outperforms the state-of-the-art VLA model trained with
robotic action labels on real-world manipulation tasks that require language
conditioning, generalization to unseen objects, and semantic generalization to
unseen instructions. Training only on human manipulation videos also shows
positive transfer, opening up the potential for leveraging web-scale data for
robotics foundation model.Summary
AI-Generated Summary