基於真實玩家遊戲數據的地理定位：大規模數據集與類人推理框架

摘要

地理定位，即識別圖像位置的任務，需要複雜的推理能力，並在導航、監控和文化保護中扮演關鍵角色。然而，現有方法往往產生粗糙、不精確且難以解釋的定位結果。主要挑戰在於現有地理定位數據集的質量和規模。這些數據集通常規模較小且自動構建，導致數據噪聲大且任務難度不一致，圖像要麼過於容易揭示答案，要麼缺乏足夠的線索進行可靠推斷。為應對這些挑戰，我們提出了一個全面的地理定位框架，包含三個關鍵組件：GeoComp，一個大規模數據集；GeoCoT，一種新穎的推理方法；以及GeoEval，一個評估指標，共同設計以解決關鍵挑戰並推動地理定位研究的進展。該框架的核心是GeoComp（地理定位競賽數據集），這是一個從地理定位遊戲平台收集的大規模數據集，涉及74萬用戶，歷時兩年。它包含2500萬條元數據和300萬個地理標記位置，覆蓋全球大部分地區，每個位置由人類用戶標註數千至數萬次。該數據集提供了多樣化的難度級別，用於詳細分析，並突顯了當前模型的關鍵不足。基於此數據集，我們提出了地理思維鏈（GeoCoT），這是一種新穎的多步推理框架，旨在增強大型視覺模型（LVMs）在地理定位任務中的推理能力。GeoCoT通過多步過程整合上下文和空間線索，模擬人類地理定位推理，從而提升性能。最後，使用GeoEval指標，我們證明GeoCoT顯著提高了地理定位準確性，最高可達25%，同時增強了可解釋性。

English

Geolocation, the task of identifying an image's location, requires complex reasoning and is crucial for navigation, monitoring, and cultural preservation. However, current methods often produce coarse, imprecise, and non-interpretable localization. A major challenge lies in the quality and scale of existing geolocation datasets. These datasets are typically small-scale and automatically constructed, leading to noisy data and inconsistent task difficulty, with images that either reveal answers too easily or lack sufficient clues for reliable inference. To address these challenges, we introduce a comprehensive geolocation framework with three key components: GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric, collectively designed to address critical challenges and drive advancements in geolocation research. At the core of this framework is GeoComp (Geolocation Competition Dataset), a large-scale dataset collected from a geolocation game platform involving 740K users over two years. It comprises 25 million entries of metadata and 3 million geo-tagged locations spanning much of the globe, with each location annotated thousands to tens of thousands of times by human users. The dataset offers diverse difficulty levels for detailed analysis and highlights key gaps in current models. Building on this dataset, we propose Geographical Chain-of-Thought (GeoCoT), a novel multi-step reasoning framework designed to enhance the reasoning capabilities of Large Vision Models (LVMs) in geolocation tasks. GeoCoT improves performance by integrating contextual and spatial cues through a multi-step process that mimics human geolocation reasoning. Finally, using the GeoEval metric, we demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.