ChatPaper.aiChatPaper

GeoVista:面向地理定位的網絡增強型智能視覺推理系統

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

November 19, 2025
作者: Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao
cs.AI

摘要

當前關於能動視覺推理的研究雖能實現深度多模態理解,但主要聚焦於圖像處理工具,尚未拓展至更通用的能動模型。本研究重新審視地理定位任務,該任務不僅需要細膩的視覺定位能力,還需在推理過程中透過網路搜索驗證或修正假設。由於現有地理定位基準未能滿足高解析度影像需求及深度能動推理的定位挑戰,我們構建了GeoBench基準數據集,包含全球各地的照片與全景圖,以及不同城市的衛星影像子集,以嚴謹評估能動模型的地理定位能力。我們同時提出GeoVista模型,該能動模型無縫整合工具調用於推理循環中,包含可放大關注區域的圖像縮放工具與檢索網路資訊的搜索工具。我們為其開發完整訓練流程,包含用於學習推理模式與工具使用先驗的冷啟動監督微調階段,以及強化推理能力的強化學習階段。透過分層獎勵機制利用多層級地理資訊,提升整體地理定位效能。實驗結果顯示,GeoVista在地理定位任務上大幅超越其他開源能動模型,在多數指標上達到與Gemini-2.5-flash、GPT-5等閉源模型相當的表現。
English
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
PDF893December 1, 2025