제로샷 다중 스펙트럼 학습: 원격 감응 응용을 위한 범용 멀티모달 Gemini 2.5 모델 재구상

초록

다중 스펙트럼 이미지는 토지 이용 분류, 환경 모니터링, 도시 계획 등 다양한 원격 감응 응용 분야에서 중요한 역할을 합니다. 이러한 이미지는 지상의 물리적 물질(예: 얼음, 물, 식생)과 강한 상관관계를 가지는 추가적인 스펙트럼 밴드를 제공하기 때문에 널리 사용됩니다. 이를 통해 더 정확한 식별이 가능하며, Sentinel-2 및 Landsat과 같은 임무에서 공개적으로 제공되므로 그 가치가 더욱 높아집니다. 현재, 이러한 데이터의 자동 분석은 주로 다중 스펙트럼 입력에 특화된 머신러닝 모델을 통해 이루어지는데, 이 모델들은 학습 및 지원에 많은 비용이 듭니다. 또한, 원격 감응에 많은 유용성을 제공하지만, 이러한 추가 입력은 강력한 일반적인 대형 다중 모달 모델과 함께 사용할 수 없습니다. 이러한 모델들은 많은 시각적 문제를 해결할 수 있지만, 특수화된 다중 스펙트럼 신호를 이해할 수는 없습니다. 이 문제를 해결하기 위해, 우리는 RGB 입력만으로 학습된 일반적인 다중 모달 모델에 새로운 다중 스펙트럼 데이터를 제로샷 모드로 입력하는 학습 없는 접근 방식을 제안합니다. 우리의 접근 방식은 다중 모달 모델의 시각적 공간 이해를 활용하고, 해당 공간에 입력을 적응시키며, 도메인 특정 정보를 모델에 지시사항으로 주입하는 것을 제안합니다. 우리는 이 아이디어를 Gemini2.5 모델로 예시를 들어, 토지 피복 및 토지 이용 분류를 위한 인기 있는 원격 감응 벤치마크에서 이 접근 방식의 강력한 제로샷 성능 향상을 관찰하고, Gemini2.5가 새로운 입력에 쉽게 적응할 수 있음을 보여줍니다. 이러한 결과는 비표준 특수 입력을 다루는 지리공간 전문가들이 Gemini2.5와 같은 강력한 다중 모달 모델을 쉽게 활용하여 작업을 가속화하고, 특수 센서 데이터에 기반한 풍부한 추론 및 문맥적 능력을 활용할 수 있는 잠재력을 강조합니다.

English

Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models' understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.

제로샷 다중 스펙트럼 학습: 원격 감응 응용을 위한 범용 멀티모달 Gemini 2.5 모델 재구상

Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

초록

Support