Vid2World:構建視頻擴散模型以實現互動世界模型
Vid2World: Crafting Video Diffusion Models to Interactive World Models
May 20, 2025
作者: Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long
cs.AI
摘要
世界模型,通过基于历史观察与行动序列预测状态转移,在提升序列决策的数据效率方面展现出巨大潜力。然而,现有的世界模型往往需要大量领域特定的训练,且生成的预测仍显粗糙、保真度低,这限制了其在复杂环境中的适用性。相比之下,基于大规模互联网数据集训练的视频扩散模型,在生成高质量视频、捕捉多样现实世界动态方面展现了令人瞩目的能力。本研究提出Vid2World,一种将预训练视频扩散模型迁移并应用于交互式世界模型的通用方法。为弥合这一差距,Vid2World通过对预训练视频扩散模型进行因果化改造,调整其架构与训练目标,以实现自回归生成。此外,它还引入了一种因果行动引导机制,以增强所得交互式世界模型中的行动可控性。在机器人操作与游戏仿真领域的广泛实验表明,我们的方法为将高性能视频扩散模型重新应用于交互式世界模型提供了一种可扩展且有效的途径。
English
World models, which predict transitions based on history observation and
action sequences, have shown great promise in improving data efficiency for
sequential decision making. However, existing world models often require
extensive domain-specific training and still produce low-fidelity, coarse
predictions, limiting their applicability in complex environments. In contrast,
video diffusion models trained on large, internet-scale datasets have
demonstrated impressive capabilities in generating high-quality videos that
capture diverse real-world dynamics. In this work, we present Vid2World, a
general approach for leveraging and transferring pre-trained video diffusion
models into interactive world models. To bridge the gap, Vid2World performs
casualization of a pre-trained video diffusion model by crafting its
architecture and training objective to enable autoregressive generation.
Furthermore, it introduces a causal action guidance mechanism to enhance action
controllability in the resulting interactive world model. Extensive experiments
in robot manipulation and game simulation domains show that our method offers a
scalable and effective approach for repurposing highly capable video diffusion
models to interactive world models.Summary
AI-Generated Summary