零样本多光谱学习:重塑面向遥感应用的通用多模态Gemini 2.5模型
Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications
September 23, 2025
作者: Ganesh Mallya, Yotam Gigi, Dahun Kim, Maxim Neumann, Genady Beryozkin, Tomer Shekel, Anelia Angelova
cs.AI
摘要
多光谱影像在土地分类、环境监测和城市规划等遥感应用中扮演着至关重要的角色。这类影像之所以被广泛采用,是因为其额外的光谱波段与地面上的物理物质(如冰、水和植被)有着强烈的相关性,从而实现了更精确的识别。同时,来自Sentinel-2和Landsat等任务的公开可用性进一步提升了其价值。目前,这类数据的自动分析主要依赖于专门为多光谱输入训练的机器学习模型,这些模型的训练和支持成本高昂。此外,尽管这些额外输入为遥感提供了诸多便利,却无法与强大的通用大型多模态模型结合使用,后者虽能解决多种视觉问题,却无法理解专门的多光谱信号。
为此,我们提出了一种无需训练的方法,以零样本模式将新的多光谱数据引入仅基于RGB输入训练的通用多模态模型。我们的方法利用多模态模型对视觉空间的理解,提出调整输入以适应该空间,并将领域特定信息作为指令注入模型。我们以Gemini2.5模型为例,展示了这一理念,并在土地覆盖和土地利用分类的流行遥感基准上观察到了显著的零样本性能提升,证明了Gemini2.5对新输入的易适应性。这些成果凸显了地理空间专业人士在处理非标准专业输入时,能够轻松利用如Gemini2.5这样的强大多模态模型,加速工作进程,并受益于其基于专业传感器数据的丰富推理和上下文理解能力。
English
Multi-spectral imagery plays a crucial role in diverse Remote Sensing
applications including land-use classification, environmental monitoring and
urban planning. These images are widely adopted because their additional
spectral bands correlate strongly with physical materials on the ground, such
as ice, water, and vegetation. This allows for more accurate identification,
and their public availability from missions, such as Sentinel-2 and Landsat,
only adds to their value. Currently, the automatic analysis of such data is
predominantly managed through machine learning models specifically trained for
multi-spectral input, which are costly to train and support. Furthermore,
although providing a lot of utility for Remote Sensing, such additional inputs
cannot be used with powerful generalist large multimodal models, which are
capable of solving many visual problems, but are not able to understand
specialized multi-spectral signals.
To address this, we propose a training-free approach which introduces new
multi-spectral data in a Zero-Shot-only mode, as inputs to generalist
multimodal models, trained on RGB-only inputs. Our approach leverages the
multimodal models' understanding of the visual space, and proposes to adapt to
inputs to that space, and to inject domain-specific information as instructions
into the model. We exemplify this idea with the Gemini2.5 model and observe
strong Zero-Shot performance gains of the approach on popular Remote Sensing
benchmarks for land cover and land use classification and demonstrate the easy
adaptability of Gemini2.5 to new inputs. These results highlight the potential
for geospatial professionals, working with non-standard specialized inputs, to
easily leverage powerful multimodal models, such as Gemini2.5, to accelerate
their work, benefiting from their rich reasoning and contextual capabilities,
grounded in the specialized sensor data.