視頻世界模型的潛在空間記憶

摘要

视频世界模型若要在生成的各帧之间维持3D空间一致性，通常需要依赖在RGB空间中构建的显式点云记忆。这种设计不仅计算成本高昂——需反复进行渲染和VAE编码，而且本质上存在信息损失，因为经过像素空间的往返过程会丢弃学习到的潜在表示中的丰富特征。在本文中，我们为视频世界模型引入了潜在空间记忆，这是一种持久化的3D缓存，可直接在扩散潜在空间中存储场景信息，避免了像素空间的重建。基于此，我们提出了Mirage，一种潜在空间记忆框架，通过深度引导反投影将潜在token提升至3D空间来构建记忆，并通过直接潜在空间扭曲合成新视角来查询记忆。这种统一的形式化方法既消除了像素空间重建的信息损失，也避免了反复编码与渲染带来的计算负担。实验表明，相较于显式3D基线方法，潜在空间记忆在端到端视频生成中实现了最高10.57倍的速度提升，并将内存占用降低了55倍。借助扩散模型的几何先验，Mirage在WorldScore上达到了最先进的性能，并在RealEstate10K上展现了强大的重建质量。

English

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.