Sat3DGen：從單張衛星影像生成全面的街道級三維場景

摘要

從單一衛星影像生成街景等級的3D場景是一項關鍵但極具挑戰性的任務。現有方法呈現出明顯的權衡取捨：幾何色彩化模型能達到高幾何保真度，但通常僅專注於建築物且缺乏語義多樣性。相比之下，基於代理的模型採用前饋式影像轉3D架構，透過聯合學習幾何與紋理來生成整體場景，此過程能產生豐富的內容，但幾何結構粗糙且不穩定。我們將這些幾何缺陷歸因於衛星到街景資料中極端的視角差距以及稀疏且不一致的監督訊號。為了解決這些根本性挑戰，我們提出了Sat3DGen，該方法體現了幾何優先的策略。此策略透過整合新穎的幾何約束與透視視角訓練策略，顯著強化前饋式架構，並直接對抗幾何誤差的主要來源。這種以幾何為中心的策略在3D精度與逼真度方面實現了大幅躍進。為了驗證，我們首先透過將VIGOR-OOD測試集與高解析度DSM資料配對，建立了一個新的基準。在此基準上，我們的方法將幾何RMSE從6.76米降至5.20米。關鍵的是，這種幾何上的躍進也提升了逼真度，相較於領先方法Sat2Density++，儘管未使用任何額外設計的影像品質模組，仍將Fréchet Inception Distance（FID）從sim40降至19。我們透過多樣化的下游應用展示了高品質3D資產的多功能性，包括語義地圖轉3D合成、多視角影片生成、大規模網格化以及無監督的單張影像數值表面模型（DSM）估計。程式碼已於https://github.com/qianmingduowan/Sat3DGen公開。

English

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from sim40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.