幾何塑造者：基於擴散先驗的開放世界視頻一致性幾何估計

摘要

儘管視頻深度估計領域取得了顯著進展，現有方法在通過仿射不變預測實現幾何保真度方面仍存在固有侷限，這限制了它們在重建及其他基於度量的下游任務中的適用性。我們提出了GeometryCrafter，這是一個新穎的框架，能夠從開放世界視頻中恢復具有時間一致性的高保真點雲序列，從而實現精確的3D/4D重建、相機參數估計以及其他基於深度的應用。我們方法的核心在於一個點雲變分自編碼器（VAE），它學習了一個與視頻潛在分佈無關的潛在空間，以實現有效的點雲編碼與解碼。利用該VAE，我們訓練了一個視頻擴散模型來建模基於輸入視頻的點雲序列分佈。在多樣化數據集上的廣泛評估表明，GeometryCrafter在3D精度、時間一致性及泛化能力方面均達到了業界領先水平。

English

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.