利用多視角幾何擴散實現零樣本新視角和深度合成。
Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
January 30, 2025
作者: Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus
cs.AI
摘要
目前用於從稀疏姿勢圖像重建3D場景的方法採用中間3D表示,如神經場、體素網格或3D高斯,以實現多視一致的場景外觀和幾何。本文介紹了MVGD,一種基於擴散的架構,能夠直接在像素級別從新視點生成圖像和深度地圖,給定任意數量的輸入視圖。我們的方法使用射線映射條件來增強視覺特徵,並從不同視點引導圖像和深度地圖的生成。我們方法的一個關鍵方面是通過可學習的任務嵌入來引導擴散過程朝向特定模態進行圖像和深度地圖的多任務生成。我們在一組來自公開數據集的超過6000萬多視圖樣本上訓練這個模型,並提出了一些技術,以實現在這樣多樣條件下的高效且一致的學習。我們還提出了一種新穎的策略,通過逐步微調較小的模型來實現更大模型的高效訓練,具有有前途的擴展行為。通過大量實驗,我們在多個新視圖合成基準測試中報告了最新的結果,以及多視圖立體和視頻深度估計。
English
Current methods for 3D scene reconstruction from sparse posed images employ
intermediate 3D representations such as neural fields, voxel grids, or 3D
Gaussians, to achieve multi-view consistent scene appearance and geometry. In
this paper we introduce MVGD, a diffusion-based architecture capable of direct
pixel-level generation of images and depth maps from novel viewpoints, given an
arbitrary number of input views. Our method uses raymap conditioning to both
augment visual features with spatial information from different viewpoints, as
well as to guide the generation of images and depth maps from novel views. A
key aspect of our approach is the multi-task generation of images and depth
maps, using learnable task embeddings to guide the diffusion process towards
specific modalities. We train this model on a collection of more than 60
million multi-view samples from publicly available datasets, and propose
techniques to enable efficient and consistent learning in such diverse
conditions. We also propose a novel strategy that enables the efficient
training of larger models by incrementally fine-tuning smaller ones, with
promising scaling behavior. Through extensive experiments, we report
state-of-the-art results in multiple novel view synthesis benchmarks, as well
as multi-view stereo and video depth estimation.Summary
AI-Generated Summary