VideoWorld 2:从真实世界视频中学习可迁移知识
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
February 10, 2026
作者: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
cs.AI
摘要
从无标注视频数据中学习可迁移知识并应用于新环境,是智能体的核心能力。本研究推出VideoWorld 2系统,在VideoWorld基础上首次实现了直接从原始真实视频中学习可迁移知识的探索。该系统的核心是动态增强的隐式动态模型(dLDM),其创新在于将动作动态与视觉外观解耦:通过预训练视频扩散模型处理视觉外观建模,使dLDM能够专注于学习紧凑且富含任务语义的动态隐编码。这些隐编码通过自回归建模学习任务策略,并支持长时序推理。我们在具有挑战性的真实世界手工艺制作任务上评估VideoWorld 2,此类任务此前常令视频生成与隐动态模型难以稳定运行。令人瞩目的是,VideoWorld 2实现了任务成功率最高70%的提升,并能生成连贯的长时执行视频。在机器人领域,我们证明VideoWorld 2可从Open-X数据集习得有效的操作知识,显著提升在CALVIN基准上的任务表现。这项研究揭示了直接从原始视频学习可迁移世界知识的潜力,所有代码、数据及模型将开源以推动后续研究。
English
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.