VideoWorld 2:从真实世界视频中学习可迁移知识
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
February 10, 2026
作者: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
cs.AI
摘要
从无标注视频数据中学习可迁移知识并应用于新环境,是智能体应具备的基础能力。本研究提出VideoWorld 2框架,在VideoWorld基础上首次探索直接从原始真实世界视频中学习可迁移知识。该框架的核心是动态增强的隐式动态模型(dLDM),其将动作动态与视觉外观解耦:通过预训练视频扩散模型处理视觉外观建模,使dLDM能够专注于学习紧凑且具任务相关性的动态隐式编码。这些隐式编码通过自回归建模学习任务策略,并支持长时序推理。我们在具有挑战性的真实世界手工艺制作任务上评估VideoWorld 2,而现有视频生成与隐式动态模型在此类任务中难以稳定运行。值得注意的是,VideoWorld 2实现了任务成功率最高70%的提升,并生成连贯的长时序执行视频。在机器人领域,我们证明VideoWorld 2能从Open-X数据集中获取有效的操作知识,显著提升在CALVIN基准上的任务表现。本研究揭示了直接从原始视频学习可迁移世界知识的潜力,所有代码、数据及模型将开源以促进后续研究。
English
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.