從影片中進行潛在動作預訓練

摘要

我們介紹了潛在動作預訓練（Latent Action Pretraining for general Action models，簡稱LAPA），這是一種無監督方法，用於預訓練視覺-語言-動作（Vision-Language-Action，簡稱VLA）模型，而無需地面真實機器人動作標籤。現有的視覺-語言-動作模型通常需要在預訓練期間由人類遠端操作者收集的動作標籤，這顯著限制了可能的數據來源和規模。在這項工作中，我們提出了一種方法，可以從沒有機器人動作標籤的互聯網規模視頻中學習。我們首先訓練一個動作量化模型，利用基於VQ-VAE的目標來學習圖像幀之間的離散潛在動作，然後預訓練一個潛在的VLA模型，從觀察和任務描述中預測這些潛在動作，最後在小規模機器人操作數據上微調VLA，將從潛在到機器人動作的映射。實驗結果表明，我們的方法明顯優於現有的從大規模視頻中訓練機器人操作策略的技術。此外，它在需要語言條件、對未見物體的泛化以及對未見指令的語義泛化的現實世界操作任務上，也優於使用機器人動作標籤訓練的最先進VLA模型。僅在人類操作視頻上進行訓練也表現出積極的轉移效果，為利用網絡規模數據進行機器人基礎模型打開了潛力。

English

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

從影片中進行潛在動作預訓練

Latent Action Pretraining from Videos

摘要

Support