ChatPaper.aiChatPaper

零樣本多光譜學習:重塑通用型多模態Gemini 2.5模型於遙感應用

Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

September 23, 2025
作者: Ganesh Mallya, Yotam Gigi, Dahun Kim, Maxim Neumann, Genady Beryozkin, Tomer Shekel, Anelia Angelova
cs.AI

摘要

多光譜影像在土地用途分類、環境監測及都市規劃等多種遙感應用中扮演著關鍵角色。這類影像之所以被廣泛採用,是因為其額外的光譜波段與地面上的物理材料(如冰、水及植被)具有強烈的相關性,從而實現更精確的識別。此外,像Sentinel-2和Landsat等任務提供的公開可用性,更增添了其價值。目前,這類數據的自動分析主要依賴於專門針對多光譜輸入訓練的機器學習模型,這些模型的訓練與維護成本高昂。儘管這些額外輸入在遙感領域提供了大量實用性,但它們無法與強大的通用大型多模態模型結合使用,這些模型雖能解決許多視覺問題,卻無法理解專門的多光譜信號。 為解決這一問題,我們提出了一種無需訓練的方法,該方法以零樣本模式引入新的多光譜數據,作為僅接受RGB輸入的通用多模態模型的輸入。我們的方法利用多模態模型對視覺空間的理解,提出適應該空間的輸入,並將領域特定信息作為指令注入模型。我們以Gemini2.5模型為例,展示了這一理念,並觀察到該方法在土地覆蓋和土地利用分類的流行遙感基準測試中顯著的零樣本性能提升,同時證明了Gemini2.5對新輸入的易適應性。這些結果凸顯了地理空間專業人士在處理非標準專門輸入時,能夠輕鬆利用如Gemini2.5等強大的多模態模型來加速工作,並受益於其基於專門傳感器數據的豐富推理與上下文理解能力。
English
Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models' understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.
PDF12September 24, 2025