TerraScope:基于像素级接地的地球观测视觉推理
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
March 19, 2026
作者: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota
cs.AI
摘要
视觉语言模型(VLMs)在地球观测领域展现出潜力,但在需要将复杂空间推理与精确像素级视觉表征相衔接的任务中仍存在不足。为解决这一问题,我们推出统一框架TerraScope,该模型具备两项核心能力以实现像素级地理空间推理:(1)模态灵活推理:既能处理单模态输入(光学或SAR数据),又能在双模态可用时自适应融合不同模态;(2)多时序推理:通过整合时间序列实现多时相变化分析。我们还构建了大规模数据集Terra-CoT,包含来自多源数据的100万样本,其推理链中嵌入了像素级掩码。同时提出首个像素级地理空间推理基准TerraScope-Bench,通过六个子任务同步评估答案准确性与掩码质量,确保真实的像素级推理能力。实验表明,TerraScope在像素级地理空间推理任务上显著优于现有VLM,并提供了可解释的视觉证据。
English
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.