CIPER：跨视角图像检索與姿態估計的统一框架

摘要

跨視角地理定位透過將地面圖像與航空圖像資料庫進行匹配來估算其地理位置。現有方法透過大規模檢索或精確姿態估計來處理此問題，但無法同時兼顧兩者：基於檢索的方法能實現廣域搜索，卻犧牲了定位精度；而姿態估計方法僅能在狹窄搜索空間內達成高精度。若直接串接這些流程，會導致誤差傳播與特徵表示不一致。我們將跨視角地理定位表述為一個統一的問題，要求同時實現城市規模的檢索與精確的三自由度姿態估計。我們提出CIPER（跨視角圖像檢索與姿態估計變換器），這是一個透過互利特徵學習同時執行兩項任務的單一架構。CIPER使用共享的變換器編碼器搭配任務特定標記，將全局檢索特徵與空間定位線索分離。為跨越地面與航空視角之間的巨大領域差異，我們引入一種雙向變換器姿態解碼器，該解碼器以地面特徵作為空間查詢進行雙向交叉注意力。進一步地，集合預測策略能在統一的多元任務目標下實現穩定的三自由度回歸。在VIGOR、KITTI與Ford Multi-AV資料集上的實驗展示了競爭力，特別是在有限視野與任意方向條件下。程式碼已於 https://github.com/yurimjeon1892/CIPER 公開。

English

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.