テラスコープ：地球観測のためのピクセル接地型視覚推論

要旨

視覚言語モデル（VLM）は地球観測（EO）分野で有望な成果を示しているが、複雑な空間推論を正確なピクセルレベル視覚表現に基づいて行うタスクには課題を抱えている。この問題を解決するため、我々はTerraScopeを提案する。これはピクセル接地型地理空間推論を実現する統一VLMであり、以下の2つの核心機能を備える：（1）モダリティ柔軟型推論：単一モダリティ入力（光学またはSAR）を扱い、両方のモダリティが利用可能な場合には適応的に異なるモダリティを推論プロセスに融合する；（2）多時期推論：複数の時間点にわたる変化分析のために時系列データを統合する。さらに、100万サンプル規模のTerra-CoTデータセットを構築した。これは複数ソースにわたる推論連鎖にピクセルレベルのマスクを埋め込んだ大規模データセットである。また、ピクセル接地型地理空間推論では初となるベンチマークTerraScope-Benchを提案する。6つのサブタスクから構成され、回答精度とマスク品質の両方を評価することで真のピクセル接地型推論を保証する。実験結果では、TerraScopeが解釈可能な視覚的証拠を提供しつつ、ピクセル接地型地理空間推論において既存VLMを大幅に上回る性能を示した。

English

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

テラスコープ：地球観測のためのピクセル接地型視覚推論

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

要旨

Support