ChatPaper.aiChatPaper

AnyRecon:基於影片擴散模型的任意視角三維重建

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

April 21, 2026
作者: Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue
cs.AI

摘要

稀疏視角三維重建對於從隨意拍攝的影像中建模場景至關重要,但對非生成式重建方法仍具挑戰性。現有基於擴散模型的方法通過合成新視角來緩解此問題,但這些方法通常僅基於一兩幀捕捉畫面進行條件生成,這不僅限制了幾何一致性,更制約了對大規模或多樣化場景的擴展能力。我們提出AnyRecon框架,能夠從任意無序稀疏輸入中實現可擴展重建,在保持顯式幾何控制的同時支持靈活的條件生成基數。為實現長程條件生成,本方法通過預置捕捉視角緩存構建持久化全局場景記憶,並取消時序壓縮以維持大視角變化下的幀級對應關係。我們發現,除了改進生成模型外,生成與重建過程的交互對大規模三維場景至關重要。為此,我們引入幾何感知條件生成策略,通過顯式三維幾何記憶與幾何驅動的捕捉視角檢索,實現生成與重建的耦合。為確保效率,我們結合四步擴散蒸餾與上下文窗口稀疏注意力機制,將計算複雜度從二次方降低。大量實驗證明,該方法能在不規則輸入、大視角差異及長軌跡等複雜條件下實現魯棒且可擴展的重建效果。
English
Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
PDF344April 23, 2026