τ_0-WM: 面向机器人操作的统一视频-动作世界模型

摘要

机器人操作需要能够生成可执行动作的模型，并在物理执行之前预测并评估其未来后果。我们提出τ₀-世界模型（τ₀-World Model, τ₀-WM），这是一个统一的视频-动作世界模型，将策略学习、视频预测和动作评估整合在单个未来预测框架中。τ₀-WM构建于共享的视频扩散主干网络之上，提供两种互补接口。首先，视频动作模型从多视角观测、语言指令和机器人状态中联合预测未来视觉潜在表示与连续动作片段。其次，基于动作条件的视频模拟器将候选动作片段展开为多视角未来画面，并预测密集的任务进度分数。该模型基于约27,300小时的真实机器人遥控操作、UMI式交互、第一人称人类视频以及使用模态特定监督掩码的展开或失败轨迹数据进行训练。在推理阶段，τ₀-WM利用测试时计算来采样动作候选，以重新去噪一致性对其进行排序，并对低质量候选调用基于模拟器的修正。在具有挑战性的长时域和精细机器人操作任务中，τ₀-WM展现出优于其他相关基线的性能。

English

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present τ_0-World Model (τ_0-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, τ_0-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately 27{,}300 hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, τ_0-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, τ_0-WM shows superior performance over other relevant baselines.