CIPER: クロスビュー画像検索と姿勢推定のための統一フレームワーク

要旨

クロスビュー地理位置推定は、地上画像を航空画像データベースと照合することでその地理的位置を推定する技術である。既存手法は大規模検索または高精度な姿勢推定のいずれかでこの課題に取り組むが、両方を同時に実現するものではない。検索ベースの手法は広域探索を可能にする一方で位置推定精度が犠牲となり、姿勢推定手法は狭い探索範囲内でのみ高精度を達成する。これらのパイプラインを単純にカスケード接続すると、誤差伝播と一貫性のない特徴表現が生じる。本稿では、クロスビュー地理位置推定を、都市規模の検索と高精度な3自由度姿勢推定を同時に要求する統一問題として定式化する。我々はCIPER（Cross-view Image-retrieval and Pose-estimation transformER）を提案する。これは単一のアーキテクチャであり、相互に有益な特徴学習を通じて両タスクを共同で実行する。CIPERは共有のトランスフォーマーエンコーダとタスク固有のトークンを用いて、大域的な検索特徴と空間位置特定の手がかりを分離する。地上ビューと航空ビューの間の大きなドメインギャップを埋めるため、双方向クロスアテンションの空間クエリとして地上特徴を利用する双方向トランスフォーマー姿勢デコーダを導入する。さらに、セット予測戦略により、統一マルチタスク目的の下で安定した3自由度回帰を可能にする。VIGOR、KITTI、Ford Multi-AVでの実験では、特に限られた視野や任意の向きの条件下で競争力のある性能を示す。コードはhttps://github.com/yurimjeon1892/CIPERで入手可能である。

English

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.