GRE Suite: 세밀하게 조정된 시각-언어 모델과 강화된 추론 체인을 통한 지리적 위치 추론

초록

최근 시각 언어 모델(Visual Language Models, VLMs)의 발전은 시각적 추론 작업에서 탁월한 성능을 보여주고 있습니다. 그러나 지리적 위치 파악(geo-localization)은 이미지에서 다양한 수준의 시각적 단서를 추출하고 이를 외부 세계 지식과 통합하여 체계적인 추론을 요구하는 독특한 도전 과제를 제시합니다. 현재의 지리적 위치 파악 접근법은 강력한 추론 메커니즘과 설명 가능성이 부족하여 그 효과가 제한적입니다. 이러한 한계를 해결하기 위해, 우리는 정확하고 해석 가능한 위치 추론을 위해 구조화된 추론 체인을 VLMs에 통합한 새로운 프레임워크인 Geo Reason Enhancement (GRE) Suite를 제안합니다. GRE Suite는 데이터셋, 모델, 벤치마크라는 세 가지 핵심 차원에서 체계적으로 개발되었습니다. 먼저, 세밀한 시각적 및 맥락적 분석을 촉진하기 위해 고품질의 지리적 위치 추론 데이터셋인 GRE30K를 소개합니다. 다음으로, 장면 속성, 지역 세부 사항, 의미적 특징을 점진적으로 추론하여 잠재적 지리적 지역을 높은 정밀도로 좁혀가는 다단계 추론 전략을 사용하는 GRE 모델을 제시합니다. 마지막으로, 다양한 도시, 자연, 랜드마크 장면에서 VLMs의 성능을 평가하여 거시적(예: 국가, 대륙) 및 미시적(예: 도시, 거리) 위치 파악 성능을 측정하는 포괄적인 평가 프레임워크인 Geo Reason Evaluation Benchmark (GREval-Bench)를 구축합니다. 실험 결과는 GRE가 모든 수준의 지리적 위치 파악 작업에서 기존 방법을 크게 능가함을 보여주며, 복잡한 지리적 추론에서 추론이 강화된 VLMs의 효용성을 입증합니다. 코드와 데이터는 https://github.com/Thorin215/GRE에서 공개될 예정입니다.

English

Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.

GRE Suite: 세밀하게 조정된 시각-언어 모델과 강화된 추론 체인을 통한 지리적 위치 추론

GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

초록

Support