τ₀-WM: ロボット操作のための統合ビデオ・アクション世界モデル

要旨

ロボット操作には、物理的実行に先立って将来の結果を予測し評価しながら、実行可能な行動を生成するモデルが必要である。本稿では、ポリシー学習、動画予測、行動評価を単一の未来予測フレームワークに統合した統一型動画・行動世界モデル「τ_0-World Model (τ_0-WM)」を提案する。共有の動画拡散バックボーン上に構築されたτ_0-WMは、2つの補完的インターフェースを提供する。第一に、動画行動モデルは、多視点観測、言語指示、ロボット状態から将来の視覚潜在変数と連続的な行動チャンクを共同で予測する。第二に、行動条件付き動画シミュレータは、候補となる行動チャンクを多視点の未来フレームに展開し、密なタスク進捗スコアを予測する。本モデルは、約27,300時間に及ぶ実ロボット遠隔操作、UMI方式のインタラクション、一人称視点の人間動画、ならびにロールアウトや失敗軌跡のデータを、モダリティ別の教師マスクを用いて学習する。推論時には、τ_0-WMはテスト時計算を活用して行動候補をサンプリングし、再ノイズ除去の整合性に基づいてランク付けし、低品質な候補にはシミュレータによる修正を適用する。挑戦的な長期的かつ細粒度のロボット操作タスクにおいて、τ_0-WMは他の関連ベースラインを上回る優れた性能を示す。

English

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present τ_0-World Model (τ_0-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, τ_0-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately 27{,}300 hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, τ_0-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, τ_0-WM shows superior performance over other relevant baselines.