Spatia：基于可更新空间记忆的视频生成技术

摘要

现有视频生成模型因视频信号的高维密集特性，难以保持长期时空一致性。为突破这一局限，我们提出空间记忆感知视频生成框架Spatia，其通过显式维护三维场景点云作为持久化空间记忆。Spatia基于该空间记忆迭代生成视频片段，并借助视觉SLAM技术持续更新记忆库。这种动态-静态解耦设计在保持模型生成逼真动态实体能力的同时，显著提升了生成过程中的空间一致性。此外，Spatia支持显式相机控制与三维感知交互式编辑等应用，为可扩展的记忆驱动视频生成提供了几何基础框架。

English

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.