EarthMind：迈向基于大型多模态模型的多粒度多传感器地球观测

摘要

大型多模态模型（LMMs）在多种视觉-语言任务中展现了卓越性能。然而，它们往往难以全面理解对地观测（EO）数据，而这类数据对于监测环境及人类活动对其影响至关重要。本研究提出了EarthMind，一种新颖的视觉-语言框架，旨在实现多粒度、多传感器EO数据的深度理解。EarthMind包含两大核心组件：(1) 空间注意力提示（SAP），通过重新分配大语言模型（LLM）内的注意力，增强像素级理解能力；(2) 跨模态融合，将异构模态对齐至共享空间，并根据信息密度自适应调整token权重，实现有效融合。为促进多传感器融合评估，我们推出了EarthMind-Bench，一个包含超过2000对人工标注的多传感器图像-问题对的综合基准，涵盖广泛的感知与推理任务。大量实验验证了EarthMind的有效性，其在EarthMind-Bench上达到了最先进的性能，尽管规模仅为4B，却超越了GPT-4o。此外，EarthMind在多个公开EO基准测试中均优于现有方法，展示了其在统一框架下应对多粒度与多传感器挑战的潜力。

English

Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data understanding. EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space and adaptively reweighs tokens based on their information density for effective fusion. To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs, covering a wide range of perception and reasoning tasks. Extensive experiments demonstrate the effectiveness of EarthMind. It achieves state-of-the-art performance on EarthMind-Bench, surpassing GPT-4o despite being only 4B in scale. Moreover, EarthMind outperforms existing methods on multiple public EO benchmarks, showcasing its potential to handle both multi-granular and multi-sensor challenges in a unified framework.

EarthMind：迈向基于大型多模态模型的多粒度多传感器地球观测

EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

摘要

Support