ChatPaper.aiChatPaper

GeoVista:面向地理定位的Web增强型智能视觉推理系统

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

November 19, 2025
作者: Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao
cs.AI

摘要

当前面向智能体的视觉推理研究虽能实现深度多模态理解,但主要聚焦于图像处理工具,尚未向通用型智能体模型拓展。本研究重新审视地理定位任务,该任务不仅需要精细的视觉定位能力,还需借助网络搜索在推理过程中验证或修正假设。针对现有地理定位基准数据集无法满足高分辨率图像需求及深度智能体推理的定位挑战,我们构建了GeoBench基准数据集,包含全球范围的普通照片与全景图像,以及不同城市的卫星图像子集,以系统评估智能体模型的地理定位能力。同时提出GeoVista智能体模型,该模型将工具调用无缝集成于推理循环中,包括用于放大感兴趣区域的图像缩放工具和检索相关网络信息的搜索工具。我们为其开发了完整训练流程:首先通过冷启动监督微调阶段学习推理模式与工具使用先验,再通过强化学习阶段进一步提升推理能力。采用分层奖励机制以利用多层次地理信息,显著提升整体地理定位性能。实验结果表明,GeoVista在地理定位任务上大幅超越其他开源智能体模型,在多数指标上达到与Gemini-2.5-flash、GPT-5等闭源模型相当的性能。
English
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
PDF893December 1, 2025