AnchorWeave:基于局部空间记忆检索的世界一致性视频生成
AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
February 16, 2026
作者: Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal
cs.AI
摘要
在长时序中保持空间世界一致性始终是相机可控视频生成的核心挑战。现有基于记忆的方法通常通过从历史重建几何中渲染锚点视频,以此作为生成条件。然而,从多视角重建全局三维场景不可避免地会引入跨视角错位问题——由于位姿和深度估计误差,同一表面在不同视角下会被重建至略微不同的三维位置。这些不一致性在融合过程中会累积成含有噪声的几何结构,进而污染条件信号并降低生成质量。我们提出AnchorWeave框架,该记忆增强型视频生成系统以多个洁净的局部几何记忆替代单一错位的全局记忆,并通过多锚点编织控制器学习调和其跨视角不一致性。具体而言,AnchorWeave执行与目标轨迹对齐的覆盖驱动式局部记忆检索,并在生成过程中通过多锚点编织控制器整合所选局部记忆。大量实验表明,AnchorWeave在保持优异视觉质量的同时显著提升了长时序场景一致性,消融实验与分析研究进一步验证了局部几何条件机制、多锚点控制策略及覆盖驱动检索的有效性。
English
Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.