CrossViewDiff: 위성에서 거리 뷰로의 합성을 위한 Cross-View 확산 모델

초록

위성에서 거리뷰 합성은 해당 위성 뷰 이미지로부터 현실적인 거리뷰 이미지를 생성하는 것을 목표로 합니다. 안정적인 확산 모델은 다양한 이미지 생성 응용 프로그램에서 높은 성능을 보여왔지만, 생성된 구조 또는 질감을 제어하기 위해 유사한 뷰 입력에 의존하는 점은 어려운 교차-뷰 합성 작업에 제한을 가합니다. 본 연구에서는 위성에서 거리뷰 합성을 위한 교차-뷰 확산 모델인 CrossViewDiff를 제안합니다. 뷰 간의 큰 차이로 인한 도전에 대응하기 위해 위성 장면 구조 추정 및 교차-뷰 질감 매핑 모듈을 설계하여 거리뷰 이미지 합성을 위한 구조적 및 질감적 제어를 구축합니다. 더 나아가, 위의 제어를 향상시킨 교차-뷰 주의 모듈을 통해 이러한 제어를 통합하는 교차-뷰 제어 안내 제거 과정을 설계합니다. 합성 결과를 보다 포괄적으로 평가하기 위해 표준 평가 메트릭에 보완으로 GPT 기반 점수화 방법을 설계합니다. 또한 이 작업에서 다양한 데이터 원본(예: 텍스트, 지도, 건물 높이 및 다중 시기 위성 이미지)의 영향을 탐구합니다. 세 개의 공개 교차-뷰 데이터셋 결과는 CrossViewDiff가 표준 및 GPT 기반 평가 메트릭 모두에서 현재 최첨단 기술을 능가하며, 시골, 교외 및 도시 장면에서 보다 현실적인 구조와 질감을 갖는 고품질 거리뷰 파노라마를 생성한다는 것을 보여줍니다. 이 작업의 코드 및 모델은 https://opendatalab.github.io/CrossViewDiff/에서 공개될 예정입니다.

English

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.

CrossViewDiff: 위성에서 거리 뷰로의 합성을 위한 Cross-View 확산 모델

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

초록

Summary

Support