80ステップで世界一周：グローバルビジュアルジオロケーションへの生成的アプローチ

要旨

グローバルビジュアルジオロケーションは、画像が地球上のどこでキャプチャされたかを予測します。画像はローカライズされる精度にばらつきがあるため、このタスクにはかなりの曖昧さが内在しています。しかしながら、既存のアプローチは決定論的であり、この側面を見落としています。本論文では、従来のジオロケーションと現代の生成手法との間のギャップを埋めることを目指します。私たちは、拡散とリーマン流マッチングに基づく初の生成型ジオロケーションアプローチを提案します。ここでは、ノイズ除去プロセスが直接地球の表面上で動作します。当モデルは、3つのビジュアルジオロケーションベンチマーク、OpenStreetView-5M、YFCC-100M、およびiNat21において最先端のパフォーマンスを達成します。さらに、モデルが単一のポイントではなく、すべての可能な場所にわたる確率分布を予測する確率的ビジュアルジオロケーションタスクを導入します。このタスクのための新しいメトリクスとベースラインを導入し、当社の拡散ベースのアプローチの利点を示します。コードとモデルは公開されます。

English

Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.

80ステップで世界一周：グローバルビジュアルジオロケーションへの生成的アプローチ

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

要旨

Support