TerraScope：基於像素級視覺推理的地球觀測技術

摘要

視覺語言模型（VLMs）在地球觀測（EO）領域展現出潛力，但在需要將複雜空間推理與精確像素級視覺表徵相結合的任務中仍存在困難。為解決此問題，我們提出 TerraScope——一個能實現像素級地理空間推理的統一視覺語言模型，具備兩大核心能力：（1）模態靈活推理：可處理單模態輸入（光學或合成孔徑雷達），並在雙模態可用時自適應融合不同模態至推理過程；（2）多時序推理：能整合時間序列數據，實現多時間點的變化分析。此外，我們構建了 Terra-CoT 大規模數據集，包含來自多來源的100萬個嵌入像素級遮罩的推理鏈樣本。我們還提出首個像素級地理空間推理基準 TerraScope-Bench，透過六項子任務同步評估答案準確性與遮罩品質，以確保真實的像素級推理。實驗表明，TerraScope 在像素級地理空間推理任務上顯著超越現有視覺語言模型，同時提供可解釋的視覺證據。

English

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

TerraScope：基於像素級視覺推理的地球觀測技術

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

摘要

Support