UniverSat: 해상도 및 모달리티에 구애받지 않는 지구 관측용 트랜스포머

초록

비전 트랜스포머(ViT)는 컴퓨터 비전 분야를 지배하고 있다. 그러나 이들이 고정된 패치 프로젝터에 의존하기 때문에 입력 모달리티, 규모, 해상도가 매우 다양한 지구 관측(EO)으로의 전이가 어렵다. 본 논문에서는 UniverSat을 제안한다. 이는 임의의 공간, 분광, 시간 해상도와 광학 및 비광학 센서 모두에서 얻은 패치를 공유된 가중치 집합을 사용하여 공유 임베딩 공간으로 매핑하는 유니버설 패치 인코더를 중심으로 구축된 ViT 스타일 백본이다. 이를 통해 자기 지도 학습을 통해 이종 멀티모달 코퍼스에서 단일 모델을 학습할 수 있으며, 센서에 구애받지 않는 강건한 공간 특징을 생성한다. 우리는 GeoBench, PANGEABench, SpectralEarth의 표준 EO 벤치마크에서 분류 및 분할 작업에 걸쳐 강력한 결과로 이 접근법을 검증한다. 코드와 모델은 https://github.com/gastruc/UniverSat에서 확인할 수 있다.

English

Vision Transformers (ViT) dominate computer vision. However, their reliance on rigid patch projectors hinders transfer to Earth Observation (EO), where input modalities, scales, and resolutions vary widely. We introduce UniverSat, a ViT-style backbone built around a Universal Patch Encoder that maps patches from arbitrary spatial, spectral, and temporal resolutions, and from both optical and non-optical sensors, into a shared embedding space with a shared set of weights. This enables training a single model on heterogeneous multimodal corpora via self-supervision, yielding robust, sensor-agnostic spatial features. We validate this approach with strong results across classification and segmentation on standard EO benchmarks from GeoBench, PANGEABench, and SpectralEarth. Our code and models are available at https://github.com/gastruc/UniverSat.