锚织记忆：基于局部空间记忆检索的世界一致性视频生成

摘要

在长时序中保持空间世界一致性始终是相机可控视频生成的核心挑战。现有基于记忆的方法通常通过从历史重建几何中渲染锚点视频，以全局重建的3D场景作为生成条件。然而，从多视角重建全局3D场景不可避免地会引入视角间错位问题——位姿与深度估计误差会导致同一表面在不同视角下被重建至略有差异的3D位置。这些不一致在融合过程中会累积成噪声几何，污染条件信号并降低生成质量。我们提出AnchorWeave框架，该记忆增强型视频生成系统以多个洁净的局部几何记忆替代单一错位的全局记忆，并学习协调其跨视角不一致性。具体而言，AnchorWeave执行与目标轨迹对齐的覆盖驱动式局部记忆检索，并通过多锚点编织控制器在生成过程中整合所选局部记忆。大量实验表明，AnchorWeave在保持优异视觉质量的同时显著提升了长时序场景一致性，消融与分析研究进一步验证了局部几何条件机制、多锚点控制策略及覆盖驱动检索的有效性。

English

Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

锚织记忆：基于局部空间记忆检索的世界一致性视频生成

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

摘要

Support