用于视频世界模型的潜在空间记忆

摘要

视频世界模型在生成帧之间保持三维空间一致性时，通常依赖于在RGB空间中构建的显式点云记忆。这种设计不仅计算成本高昂（需要重复进行渲染和VAE编码），而且本质上存在信息损失——因为经过像素空间的往返过程会丢弃学习到的潜在表示中的丰富特征。本文针对视频世界模型提出了潜在空间记忆，这是一个直接存储在扩散潜空间中的持久化三维缓存，避免了像素空间重建。在此基础上，我们提出了Mirage——一种潜在空间记忆框架，通过深度引导反向投影将潜在标记提升到三维空间来构建记忆，并通过直接进行潜在空间扭曲来合成新视角以实现查询。这一统一范式既消除了像素空间重建的信息损失，也去除了重复编码和渲染的计算负担。实验表明，相比显式三维基线方法，潜在空间记忆在端到端视频生成速度上提升了最高10.57倍，内存占用减少了55倍。通过利用扩散模型的几何先验，Mirage在WorldScore上达到了最先进的性能，并在RealEstate10K上展现了强大的重建质量。

English

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.