ChatPaper.aiChatPaper

地图思维:强化并行地图增强智能体在地理定位中的应用

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

January 8, 2026
作者: Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu
cs.AI

摘要

图像地理定位任务旨在利用视觉线索预测图像在地球上的拍摄位置。现有的大型视觉语言模型方法虽然利用了世界知识、链式思维推理和智能体能力,却忽略了人类常用的策略——地图辅助定位。本研究首次为模型赋予"地图思维"能力,将其构建为"地图中的智能体"循环框架。我们开发了包含智能体强化学习与并行测试时扩展的两阶段优化方案:强化学习提升智能体采样效率,并行测试时扩展使模型能在最终预测前探索多条候选路径——这对地理定位至关重要。为评估方法在最新真实场景图像上的性能,我们进一步提出MAPBench基准,该训练评估体系完全由真实世界图像构成。实验表明,我们的方法在多数指标上超越现有开源与闭源模型,尤其将Acc@500m指标从Gemini-3-Pro结合谷歌搜索/地图模式的8.0%提升至22.1%。
English
The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model Thinking with Map ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to Gemini-3-Pro with Google Search/Map grounded mode.
PDF1293January 13, 2026