GRE Suite：微調整された視覚言語モデルと強化された推論チェーンによる地理的位置推定

要旨

視覚言語モデル（VLM）の最近の進展は、視覚的推論タスクにおいて卓越した性能を示してきました。しかし、地理的位置推定（geo-localization）は独特の課題を抱えており、画像から多粒度の視覚的手がかりを抽出し、それらを外部の世界知識と統合して体系的な推論を行う必要があります。現在の地理的位置推定タスクへのアプローチは、堅牢な推論メカニズムと説明可能性を欠いており、その有効性が制限されています。これらの課題に対処するため、我々はGeo Reason Enhancement（GRE）Suiteを提案します。これは、構造化された推論チェーンをVLMに組み込むことで、正確で解釈可能な位置推定を実現する新しいフレームワークです。GRE Suiteは、データセット、モデル、ベンチマークという3つの主要な次元にわたって体系的に開発されています。まず、細粒度の視覚的および文脈的分析を促進するために設計された高品質な地理的位置推定推論データセットであるGRE30Kを紹介します。次に、GREモデルを提示します。このモデルは、多段階の推論戦略を採用し、シーン属性、局所的な詳細、および意味的特徴を段階的に推論することで、潜在的な地理的領域を高精度に絞り込みます。最後に、Geo Reason Evaluation Benchmark（GREval-Bench）を構築します。これは、多様な都市、自然、ランドマークシーンにわたってVLMを評価し、粗粒度（例：国、大陸）および細粒度（例：都市、通り）の位置推定性能を測定する包括的な評価フレームワークです。実験結果は、GREがすべての粒度の地理的位置推定タスクにおいて既存の手法を大幅に上回ることを示しており、推論を強化したVLMの複雑な地理的推論における有効性を強調しています。コードとデータはhttps://github.com/Thorin215/GREで公開されます。

English

Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.