VMem：基于面元索引视图记忆的交互式视频场景一致性生成

摘要

我们提出了一种新颖的记忆机制，用于构建能够交互式探索环境的视频生成器。以往类似成果的实现方式主要有两种：一是通过逐步重建场景的三维几何结构并外推二维视图，但这种方法会迅速累积误差；二是采用上下文窗口较短的视频生成器，然而这类方法难以长期保持场景的一致性。为克服这些局限，我们引入了基于表面元索引的视图记忆机制（VMem），该机制通过将过去观察到的视图基于其记录的三维表面元素（surfels）进行几何索引来存储。VMem能够在生成新视图时高效检索最相关的历史视图。通过仅聚焦于这些相关视图，我们的方法以远低于使用所有历史视图作为上下文计算成本的代价，实现了对想象环境的一致性探索。我们在具有挑战性的长期场景合成基准上评估了该方法，结果表明在维持场景一致性和相机控制方面，我们的方法相较于现有技术展现出了更优的性能。

English

We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

VMem：基于面元索引视图记忆的交互式视频场景一致性生成

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

摘要

Support