GRE套件：基於微調視覺-語言模型與強化推理鏈的地理定位推斷

摘要

視覺語言模型（VLMs）的最新進展在視覺推理任務中展現了卓越的性能。然而，地理定位任務面臨獨特的挑戰，需要從圖像中提取多層次的視覺線索，並將其與外部世界知識進行系統性整合以進行推理。當前的地理定位方法往往缺乏穩健的推理機制和可解釋性，限制了其有效性。為解決這些限制，我們提出了地理推理增強套件（GRE Suite），這是一個新穎的框架，通過結構化的推理鏈增強VLMs，以實現準確且可解釋的位置推斷。GRE Suite在三個關鍵維度上系統性地開發：數據集、模型和基準。首先，我們引入了GRE30K，這是一個高質量的地理定位推理數據集，旨在促進細粒度的視覺和上下文分析。接著，我們提出了GRE模型，該模型採用多階段推理策略，逐步推斷場景屬性、局部細節和語義特徵，從而精確縮小潛在的地理區域範圍。最後，我們構建了地理推理評估基準（GREval-Bench），這是一個全面的評估框架，用於評估VLMs在多樣化的城市、自然和地標場景中的表現，測量其粗粒度（如國家、大陸）和細粒度（如城市、街道）的定位性能。實驗結果表明，GRE在所有層次的地理定位任務中均顯著優於現有方法，凸顯了推理增強型VLMs在複雜地理推斷中的有效性。代碼和數據將在https://github.com/Thorin215/GRE 發布。

English

Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.