NAVIG: 画像地理位置推定のための視覚言語モデルを用いた自然言語誘導型分析

要旨

画像ジオローカライゼーションは、画像の特定の位置を予測するタスクであり、視覚的、地理的、文化的な文脈にわたる複雑な推論を必要とします。従来のVision Language Models（VLMs）はこのタスクにおいて最高の精度を誇りますが、分析的推論のための高品質なデータセットとモデルが不足しています。私たちはまず、人気の地理ゲームであるGeoGuessrから派生した高品質なデータセット「NaviClues」を作成し、言語からの専門家の推論例を提供します。このデータセットを使用して、グローバルな情報と細粒度な画像情報を統合した包括的な画像ジオローカライゼーションフレームワーク「Navig」を提案します。言語を用いた推論により、Navigは従来の最先端モデルと比較して平均距離誤差を14％削減し、1000未満のトレーニングサンプルしか必要としません。私たちのデータセットとコードはhttps://github.com/SparrowZheyuan18/Navig/で公開されています。

English

Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.