视频中的潜在动作预训练

摘要

我们介绍了Latent Action Pretraining for general Action models (LAPA)，这是一种无监督的方法，用于预训练Vision-Language-Action (VLA)模型，而无需地面真实机器人动作标签。现有的Vision-Language-Action模型通常需要在预训练期间由人类远程操作员收集的动作标签，这显著限制了可能的数据来源和规模。在这项工作中，我们提出了一种方法，可以从没有机器人动作标签的互联网规模视频中学习。我们首先训练一个动作量化模型，利用基于VQ-VAE的目标来学习图像帧之间的离散潜在动作，然后预训练一个潜在的VLA模型，以从观察和任务描述中预测这些潜在动作，最后在小规模机器人操作数据上微调VLA，将潜在动作映射到机器人动作。实验结果表明，我们的方法明显优于现有的从大规模视频中训练机器人操作策略的技术。此外，它在需要语言条件、泛化到未见对象以及语义泛化到未见指令的真实世界操作任务上，也优于使用机器人动作标签训练的最先进VLA模型。仅在人类操作视频上训练也表现出积极的迁移效果，为利用网络规模数据进行机器人基础模型打开了潜力。

English

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

视频中的潜在动作预训练

Latent Action Pretraining from Videos

摘要

Support