NAVIG:基於視覺語言模型的自然語言引導分析於影像地理定位
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
February 20, 2025
作者: Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber
cs.AI
摘要
圖像地理定位是一項預測圖像具體位置的任務,需要跨越視覺、地理和文化背景的複雜推理。雖然先前的視覺語言模型(VLMs)在此任務上具有最佳準確度,但高質量的數據集和分析推理模型仍然匱乏。我們首先創建了NaviClues,這是一個源自熱門地理遊戲GeoGuessr的高質量數據集,旨在提供專家級語言推理的範例。利用此數據集,我們提出了Navig,一個整合全局與細粒度圖像信息的全面圖像地理定位框架。通過語言推理,Navig將平均距離誤差相較於先前最先進的模型減少了14%,且所需訓練樣本少於1000個。我們的數據集和代碼可在https://github.com/SparrowZheyuan18/Navig/ 獲取。
English
Image geo-localization is the task of predicting the specific location of an
image and requires complex reasoning across visual, geographical, and cultural
contexts. While prior Vision Language Models (VLMs) have the best accuracy at
this task, there is a dearth of high-quality datasets and models for analytical
reasoning. We first create NaviClues, a high-quality dataset derived from
GeoGuessr, a popular geography game, to supply examples of expert reasoning
from language. Using this dataset, we present Navig, a comprehensive image
geo-localization framework integrating global and fine-grained image
information. By reasoning with language, Navig reduces the average distance
error by 14% compared to previous state-of-the-art models while requiring fewer
than 1000 training samples. Our dataset and code are available at
https://github.com/SparrowZheyuan18/Navig/.Summary
AI-Generated Summary