CIPER：跨视角图像检索与姿态估计的统一框架

摘要

跨视角地理定位通过将地面图像与航空图像数据库进行匹配来估计其地理位置。现有方法通过大规模检索或精确位姿估计来解决这一问题，但无法同时兼顾两者：基于检索的方法可实现广域搜索，但牺牲了定位精度；而位姿估计方法仅在有限搜索空间内实现高精度。简单级联这些流程会导致误差传播和特征表示不一致。本文将跨视角地理定位表述为一个统一问题，要求同时实现城市级检索和精确的三自由度位姿估计。我们提出CIPER（跨视角图像检索与位姿估计Transformer），这是一种通过互利特征学习联合执行两项任务的单一架构。CIPER采用共享Transformer编码器及任务特定标记，将全局检索特征与空间定位线索解耦。为弥合地面与航空视角间巨大的领域差异，我们引入双向Transformer位姿解码器，利用地面特征作为空间查询实现双向交叉注意力。基于集合预测的策略进一步在统一多目标优化下实现稳定的三自由度回归。在VIGOR、KITTI和Ford Multi-AV数据集上的实验表明，该方法尤其在有限视场角和任意朝向条件下性能优异。代码已开源：https://github.com/yurimjeon1892/CIPER。

English

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.