統一世界モデル：大規模ロボットデータセットにおける事前学習のためのビデオとアクション拡散の結合

要旨

模倣学習は、汎用ロボットの構築に向けた有望なアプローチとして注目を集めています。しかし、高品質な専門家のデモンストレーションに依存するため、大規模なロボット基盤モデルへの模倣学習のスケーリングは依然として課題となっています。一方で、多様な環境や行動を描いた大量のビデオデータが容易に利用可能です。このデータは、現実世界のダイナミクスやエージェントと環境の相互作用に関する豊富な情報源を提供します。しかし、現代の手法の多くに必要な行動アノテーションが欠如しているため、このデータを直接模倣学習に活用することは困難でした。本研究では、ビデオデータと行動データの両方を活用してポリシー学習を行うためのフレームワークであるUnified World Models (UWM)を提案します。具体的には、UWMは行動拡散プロセスとビデオ拡散プロセスを統合されたトランスフォーマーアーキテクチャ内に統合し、各モダリティを独立した拡散タイムステップで制御します。各拡散タイムステップを単純に制御するだけで、UWMはポリシー、順ダイナミクス、逆ダイナミクス、ビデオ生成器を柔軟に表現できることを示します。シミュレーションおよび実世界の実験を通じて、(1) UWMはダイナミクスと行動予測を伴う大規模なマルチタスクロボットデータセットでの効果的な事前学習を可能にし、模倣学習よりも汎用性とロバスト性の高いポリシーを実現すること、(2) UWMはモダリティ固有の拡散タイムステップを独立して制御することで、行動フリーのビデオデータからの学習を自然に促進し、ファインチューニングされたポリシーの性能をさらに向上させることを示します。我々の結果は、UWMが大規模で異種混合のデータセットを活用したスケーラブルなロボット学習に向けた有望な一歩を提供し、しばしば異なるパラダイムである模倣学習と世界モデリングの間のシンプルな統一を実現することを示唆しています。ビデオとコードはhttps://weirdlabuw.github.io/uwm/で公開されています。

English

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

統一世界モデル：大規模ロボットデータセットにおける事前学習のためのビデオとアクション拡散の結合

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

要旨

Support