VideoWorld 2：从真实世界视频中学习可迁移知识

摘要

从无标注视频数据中学习可迁移知识并应用于新环境，是智能体的核心能力。本研究推出VideoWorld 2系统，在VideoWorld基础上首次实现了直接从原始真实视频中学习可迁移知识的探索。该系统的核心是动态增强的隐式动态模型（dLDM），其创新在于将动作动态与视觉外观解耦：通过预训练视频扩散模型处理视觉外观建模，使dLDM能够专注于学习紧凑且富含任务语义的动态隐编码。这些隐编码通过自回归建模学习任务策略，并支持长时序推理。我们在具有挑战性的真实世界手工艺制作任务上评估VideoWorld 2，此类任务此前常令视频生成与隐动态模型难以稳定运行。令人瞩目的是，VideoWorld 2实现了任务成功率最高70%的提升，并能生成连贯的长时执行视频。在机器人领域，我们证明VideoWorld 2可从Open-X数据集习得有效的操作知识，显著提升在CALVIN基准上的任务表现。这项研究揭示了直接从原始视频学习可迁移世界知识的潜力，所有代码、数据及模型将开源以推动后续研究。

English

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.

VideoWorld 2：从真实世界视频中学习可迁移知识

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

摘要

Support