UniverSat: 解像度およびモダリティに依存しない地球観測用トランスフォーマー

要旨

Vision Transformers (ViT) はコンピュータビジョンを席巻している。しかしながら、それらが剛直なパッチプロジェクタに依存していることにより、入力モダリティ、スケール、解像度が大きく異なる地球観測（EO）への転移が妨げられている。本稿では、任意の空間・スペクトル・時間解像度、および光学・非光学両方のセンサからのパッチを、共有の重みセットを用いて共有埋め込み空間にマッピングする Universal Patch Encoder を中心に据えた、ViT スタイルのバックボーンである UniverSat を提案する。これにより、自己教師あり学習を介して不均一なマルチモーダルコーパス上で単一モデルを訓練することが可能となり、頑健でセンサ非依存の空間特徴が得られる。本手法を、GeoBench、PANGEABench、SpectralEarth の標準 EO ベンチマークにおける分類・セグメンテーションタスクで検証し、優れた結果を得た。コードとモデルは https://github.com/gastruc/UniverSat で公開している。

English

Vision Transformers (ViT) dominate computer vision. However, their reliance on rigid patch projectors hinders transfer to Earth Observation (EO), where input modalities, scales, and resolutions vary widely. We introduce UniverSat, a ViT-style backbone built around a Universal Patch Encoder that maps patches from arbitrary spatial, spectral, and temporal resolutions, and from both optical and non-optical sensors, into a shared embedding space with a shared set of weights. This enables training a single model on heterogeneous multimodal corpora via self-supervision, yielding robust, sensor-agnostic spatial features. We validate this approach with strong results across classification and segmentation on standard EO benchmarks from GeoBench, PANGEABench, and SpectralEarth. Our code and models are available at https://github.com/gastruc/UniverSat.