AerialMegaDepth:學習空中-地面重建與視圖合成
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
April 17, 2025
作者: Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, Shubham Tulsiani
cs.AI
摘要
我們探索了從地面和空中視角混合拍攝的圖像的幾何重建任務。當前最先進的基於學習的方法無法處理空中與地面圖像對之間極端的視角變化。我們的假設是,缺乏高質量、共同配準的空中-地面數據集進行訓練是這一失敗的關鍵原因。這類數據難以組裝,正是因為其難以以可擴展的方式重建。為克服這一挑戰,我們提出了一個可擴展的框架,結合了從3D城市級網格(如Google Earth)生成的偽合成渲染與真實的地面級眾包圖像(如MegaDepth)。偽合成數據模擬了廣泛的空中視角,而真實的眾包圖像則幫助提高了地面級圖像的視覺逼真度,在網格渲染缺乏足夠細節的情況下,有效彌合了真實圖像與偽合成渲染之間的領域差距。利用這一混合數據集,我們對多種最先進的算法進行了微調,並在真實世界的零樣本空中-地面任務中取得了顯著改進。例如,我們觀察到基線DUSt3R在相機旋轉誤差5度以內定位的空中-地面對少於5%,而使用我們的數據進行微調後,準確率提升至近56%,解決了處理大視角變化時的一個主要失敗點。除了相機估計和場景重建外,我們的數據集還提升了在具有挑戰性的空中-地面場景中進行新視角合成等下遊任務的性能,展示了我們方法在實際應用中的實用價值。
English
We explore the task of geometric reconstruction of images captured from a
mixture of ground and aerial views. Current state-of-the-art learning-based
approaches fail to handle the extreme viewpoint variation between aerial-ground
image pairs. Our hypothesis is that the lack of high-quality, co-registered
aerial-ground datasets for training is a key reason for this failure. Such data
is difficult to assemble precisely because it is difficult to reconstruct in a
scalable way. To overcome this challenge, we propose a scalable framework
combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google
Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The
pseudo-synthetic data simulates a wide range of aerial viewpoints, while the
real, crowd-sourced images help improve visual fidelity for ground-level images
where mesh-based renderings lack sufficient detail, effectively bridging the
domain gap between real images and pseudo-synthetic renderings. Using this
hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve
significant improvements on real-world, zero-shot aerial-ground tasks. For
example, we observe that baseline DUSt3R localizes fewer than 5% of
aerial-ground pairs within 5 degrees of camera rotation error, while
fine-tuning with our data raises accuracy to nearly 56%, addressing a major
failure point in handling large viewpoint changes. Beyond camera estimation and
scene reconstruction, our dataset also improves performance on downstream tasks
like novel-view synthesis in challenging aerial-ground scenarios, demonstrating
the practical value of our approach in real-world applications.Summary
AI-Generated Summary