Sat3DGen: 単一衛星画像からの包括的な街路レベル3Dシーン生成

要旨

単一の衛星画像からストリートレベルの3Dシーンを生成することは、重要ながらも困難な課題である。現在の手法には明確なトレードオフが存在する。幾何学-色付けモデルは高い幾何学的忠実度を達成するが、典型的には建物に特化しており、意味的多様性に欠ける。対照的に、プロキシベースのモデルはフィードフォワード型画像から3Dへのフレームワークを用い、幾何学とテクスチャを共同学習することで包括的なシーンを生成する。このプロセスは豊かなコンテンツを生み出す一方で、粗く不安定な幾何学をもたらす。我々は、これらの幾何学的失敗の原因を、衛星からストリートへのデータに固有の極端な視点ギャップと疎で一貫性のない監督にあると考える。これらの根本的な課題に対処するため、我々はSat3DGenを導入する。これは幾何学優先の方法論を具現化したものである。この方法論は、新規な幾何学的制約と視点に基づくトレーニング戦略を統合することでフィードフォワードパラダイムを強化し、幾何学的誤差の主な原因に明示的に対抗する。この幾何学中心の戦略により、3D精度とフォトリアリズムの両方で劇的な飛躍が達成される。検証のため、我々はまずVIGOR-OODテストセットと高解像度DSMデータをペアリングして新しいベンチマークを構築した。このベンチマークにおいて、本手法は幾何学的RMSEを6.76mから5.20mに改善した。重要なことに、この幾何学的飛躍はフォトリアリズムも向上させ、最先端手法であるSat2Density++に対して、特別な画像品質モジュールを追加していないにもかかわらず、Fréchet Inception Distance（FID）をsim40から19に低減した。我々は、この高品質な3Dアセットの多様性を、セマンティックマップから3Dへの合成、マルチカメラ動画生成、大規模メッシュ化、教師なし単一画像デジタルサーフェスモデル（DSM）推定など、多様な下流アプリケーションを通じて実証する。コードはhttps://github.com/qianmingduowan/Sat3DGenで公開されている。

English

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from sim40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.