Sat3DGen: 단일 위성 이미지로부터 포괄적인 거리 수준 3D 장면 생성

초록

단일 위성 이미지로부터 거리 수준의 3D 장면을 생성하는 것은 중요하면서도 어려운 과제이다. 현재 방법들은 뚜렷한 절충점을 보여준다: 기하-색상화 모델은 높은 기하 정밀도를 달성하지만 일반적으로 건물에 초점이 맞춰져 있어 의미적 다양성이 부족하다. 대조적으로, 프록시 기반 모델은 피드포워드 이미지-3D 프레임워크를 사용하여 기하학과 텍스처를 공동 학습함으로써 전체적인 장면을 생성하는데, 이 과정은 풍부한 콘텐츠를 제공하지만 거칠고 불안정한 기하학을 초래한다. 우리는 이러한 기하학적 실패의 원인을 위성-거리 데이터에 내재된 극단적인 시점 차이와 희소하고 일관성 없는 감독에서 찾는다. 이러한 근본적인 문제를 해결하기 위해 Sat3DGen을 도입하며, 이는 기하학 우선 방법론을 구현한다. 이 방법론은 새로운 기하학적 제약 조건을 시점 훈련 전략과 통합하여 피드포워드 패러다임을 강화함으로써 기하학적 오류의 주요 원인을 명시적으로 대처한다. 이러한 기하학 중심 전략은 3D 정확도와 사실성 모두에서 비약적인 향상을 가져온다. 검증을 위해 먼저 VIGOR-OOD 테스트 세트와 고해상도 DSM 데이터를 결합하여 새로운 벤치마크를 구축했다. 이 벤치마크에서 우리의 방법은 기하 RMSE를 6.76m에서 5.20m으로 개선했다. 결정적으로, 이러한 기하학적 도약은 사실성도 향상시켜, 추가적인 맞춤형 이미지 품질 모듈 없이도 최고 방법인 Sat2Density++ 대비 Fréchet Inception Distance(FID)를 sim40에서 19로 낮췄다. 우리는 의미 맵-3D 합성, 다중 카메라 비디오 생성, 대규모 메싱, 비지도 단일 이미지 DSM(Digital Surface Model) 추정 등 다양한 하위 응용 프로그램을 통해 고품질 3D 자산의 다재다능함을 입증한다. 코드는 https://github.com/qianmingduowan/Sat3DGen에서 공개되었다.

English

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from sim40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.