视频世界模型与长期空间记忆
Video World Models with Long-term Spatial Memory
June 5, 2025
作者: Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein
cs.AI
摘要
新兴的世界模型能够根据诸如相机移动和文本提示等控制信号,自回归地生成视频帧。由于时间上下文窗口大小的限制,这些模型在场景重访时往往难以维持一致性,导致对先前生成环境的严重遗忘。受人类记忆机制的启发,我们引入了一种新颖的框架,通过基于几何的长期空间记忆来增强视频世界模型的长期一致性。我们的框架包含从长期空间记忆中存储和检索信息的机制,并策划了定制数据集来训练和评估具有明确存储的三维记忆机制的世界模型。我们的评估显示,与相关基线相比,在质量、一致性和上下文长度方面均有提升,为长期一致的世界生成铺平了道路。
English
Emerging world models autoregressively generate video frames in response to
actions, such as camera movements and text prompts, among other control
signals. Due to limited temporal context window sizes, these models often
struggle to maintain scene consistency during revisits, leading to severe
forgetting of previously generated environments. Inspired by the mechanisms of
human memory, we introduce a novel framework to enhancing long-term consistency
of video world models through a geometry-grounded long-term spatial memory. Our
framework includes mechanisms to store and retrieve information from the
long-term spatial memory and we curate custom datasets to train and evaluate
world models with explicitly stored 3D memory mechanisms. Our evaluations show
improved quality, consistency, and context length compared to relevant
baselines, paving the way towards long-term consistent world generation.