CrossViewDiff：衛星から通り景色へのシンセシスのためのクロスビューディフュージョンモデル

要旨

衛星から街路景観への合成は、対応する衛星景観画像から現実的な街路景観画像を生成することを目指しています。安定した拡散モデルは、さまざまな画像生成アプリケーションで顕著なパフォーマンスを発揮してきましたが、生成された構造やテクスチャを制御するために類似した視点の入力に依存しているため、難しいクロスビュー合成タスクには適用できません。本研究では、衛星から街路景観への合成のためのクロスビュー拡散モデルであるCrossViewDiffを提案します。異なる視点間の大きな不一致に対処するために、衛星シーン構造の推定とクロスビューテクスチャマッピングモジュールを設計し、街路景観画像合成のための構造的およびテクスチャルな制御を構築します。さらに、上記の制御を強化したクロスビューアテンションモジュールを介して取り込むクロスビューコントロール誘導ノイズリダクションプロセスを設計します。合成結果のより包括的な評価を達成するために、標準的な評価メトリクスの補足としてGPTベースのスコアリング方法を設計します。また、このタスクにおける異なるデータソース（例：テキスト、地図、建物の高さ、および多時点衛星画像）の影響を探究します。公開されている3つのクロスビューデータセットでの結果は、CrossViewDiffが標準的およびGPTベースの評価メトリクスの両方で現行の最先端技術を上回り、高品質な街路景観パノラマを生成し、田舎、郊外、都市のシーン全体にわたってより現実的な構造とテクスチャを提供していることを示しています。この研究のコードとモデルは、https://opendatalab.github.io/CrossViewDiff/ で公開されます。

English

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.

CrossViewDiff：衛星から通り景色へのシンセシスのためのクロスビューディフュージョンモデル

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

要旨

Support