Vid2World:打造视频扩散模型至交互式世界模型
Vid2World: Crafting Video Diffusion Models to Interactive World Models
May 20, 2025
作者: Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long
cs.AI
摘要
世界模型通过历史观测和动作序列预测状态转移,在提升序列决策的数据效率方面展现出巨大潜力。然而,现有世界模型往往需要大量领域特定训练,且生成的预测精度低、粒度粗,限制了其在复杂环境中的适用性。相比之下,基于大规模互联网数据集训练的视频扩散模型,在生成高质量、捕捉多样现实世界动态的视频方面展现了卓越能力。本研究提出Vid2World,一种将预训练视频扩散模型迁移并应用于交互式世界模型的通用方法。为弥合这一差距,Vid2World通过对预训练视频扩散模型进行因果化改造,调整其架构与训练目标,以实现自回归生成。此外,该方法引入了一种因果动作引导机制,以增强所得交互式世界模型中的动作可控性。在机器人操作和游戏仿真领域的广泛实验表明,我们的方法为将高性能视频扩散模型重新应用于交互式世界模型提供了一种可扩展且有效的途径。
English
World models, which predict transitions based on history observation and
action sequences, have shown great promise in improving data efficiency for
sequential decision making. However, existing world models often require
extensive domain-specific training and still produce low-fidelity, coarse
predictions, limiting their applicability in complex environments. In contrast,
video diffusion models trained on large, internet-scale datasets have
demonstrated impressive capabilities in generating high-quality videos that
capture diverse real-world dynamics. In this work, we present Vid2World, a
general approach for leveraging and transferring pre-trained video diffusion
models into interactive world models. To bridge the gap, Vid2World performs
casualization of a pre-trained video diffusion model by crafting its
architecture and training objective to enable autoregressive generation.
Furthermore, it introduces a causal action guidance mechanism to enhance action
controllability in the resulting interactive world model. Extensive experiments
in robot manipulation and game simulation domains show that our method offers a
scalable and effective approach for repurposing highly capable video diffusion
models to interactive world models.Summary
AI-Generated Summary