ビデオからの潜在的なアクション事前トレーニング

要旨

一般的なアクションモデル（LAPA）のための潜在アクション事前トレーニングを紹介します。これは、地面の真実のロボットアクションラベルがない状態で、Vision-Language-Action（VLA）モデルを事前トレーニングするための教師なし手法です。既存のVision-Language-Actionモデルは、通常、事前トレーニング中に人間のテレオペレーターによって収集されるアクションラベルが必要であり、これは可能なデータソースとスケールを大幅に制限します。この研究では、ロボットアクションラベルのないインターネット規模のビデオから学習する方法を提案しています。最初に、画像フレーム間の離散的な潜在アクションを学習するためにVQ-VAEベースの目的を活用したアクション量子化モデルをトレーニングし、次に、これらの潜在アクションを観察とタスクの説明から予測するための潜在VLAモデルを事前トレーニングし、最後に、潜在からロボットアクションへのマッピングを行うために、小規模なロボット操作データでVLAを微調整します。実験結果は、当社の手法が大規模なビデオからロボット操作ポリシーをトレーニングする既存の技術を大幅に上回ることを示しています。さらに、言語の条件付け、未知のオブジェクトへの一般化、未知の命令への意味論的一般化が必要な実世界の操作タスクでロボットアクションラベルでトレーニングされた最先端のVLAモデルを上回っています。人間の操作ビデオのみでトレーニングした結果もポジティブな転送が示され、ロボティクス基盤モデルにおいてWebスケールのデータを活用する可能性が開かれています。

English

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

ビデオからの潜在的なアクション事前トレーニング

Latent Action Pretraining from Videos

要旨

Support