Vid2World：打造视频扩散模型至交互式世界模型

摘要

世界模型通过历史观测和动作序列预测状态转移，在提升序列决策的数据效率方面展现出巨大潜力。然而，现有世界模型往往需要大量领域特定训练，且生成的预测精度低、粒度粗，限制了其在复杂环境中的适用性。相比之下，基于大规模互联网数据集训练的视频扩散模型，在生成高质量、捕捉多样现实世界动态的视频方面展现了卓越能力。本研究提出Vid2World，一种将预训练视频扩散模型迁移并应用于交互式世界模型的通用方法。为弥合这一差距，Vid2World通过对预训练视频扩散模型进行因果化改造，调整其架构与训练目标，以实现自回归生成。此外，该方法引入了一种因果动作引导机制，以增强所得交互式世界模型中的动作可控性。在机器人操作和游戏仿真领域的广泛实验表明，我们的方法为将高性能视频扩散模型重新应用于交互式世界模型提供了一种可扩展且有效的途径。

English

World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.

Vid2World：打造视频扩散模型至交互式世界模型

Vid2World: Crafting Video Diffusion Models to Interactive World Models

摘要

Support