NAVIG：基於視覺語言模型的自然語言引導分析於影像地理定位

摘要

圖像地理定位是一項預測圖像具體位置的任務，需要跨越視覺、地理和文化背景的複雜推理。雖然先前的視覺語言模型（VLMs）在此任務上具有最佳準確度，但高質量的數據集和分析推理模型仍然匱乏。我們首先創建了NaviClues，這是一個源自熱門地理遊戲GeoGuessr的高質量數據集，旨在提供專家級語言推理的範例。利用此數據集，我們提出了Navig，一個整合全局與細粒度圖像信息的全面圖像地理定位框架。通過語言推理，Navig將平均距離誤差相較於先前最先進的模型減少了14%，且所需訓練樣本少於1000個。我們的數據集和代碼可在https://github.com/SparrowZheyuan18/Navig/ 獲取。

English

Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.