AnyRecon：基于视频扩散模型的任意视角三维重建

摘要

稀疏视角三维重建对于从非专业拍摄中建模场景至关重要，但对非生成式重建方法仍具挑战性。现有基于扩散模型的方法通过合成新视图缓解此问题，但通常仅依赖一两帧捕捉画面进行条件生成，这既限制了几何一致性，也制约了其对大规模或多样化场景的扩展能力。我们提出AnyRecon框架，能够从任意无序稀疏输入中实现可扩展重建，在保持显式几何控制的同时支持灵活的条件基数。为实现长程条件建模，本方法通过预置捕捉视图缓存构建持久化全局场景记忆，并取消时序压缩以维持大视角变化下的帧级对应关系。除改进生成模型外，我们还发现生成与重建的交互对大规模三维场景至关重要。因此，我们引入几何感知条件策略，通过显式三维几何记忆和几何驱动的捕捉视图检索，将生成与重建过程耦合。为确保效率，我们结合四步扩散蒸馏与上下文窗口稀疏注意力机制，将二次复杂度降至线性。大量实验表明，该方法在非规则输入、大视角差异及长轨迹场景下均能实现鲁棒且可扩展的重建效果。

English

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

AnyRecon：基于视频扩散模型的任意视角三维重建

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

摘要

Support