DynamicVis: 원격 감지 이미지 이해를 위한 효율적이고 범용적인 시각적 기초 모델

초록

원격 감지 기술의 발전으로 위성 이미지의 공간 해상도가 향상되어 다양한 해석을 위한 더욱 세밀한 시각적 표현이 가능해졌습니다. 그러나 기존 방법들은 다양한 응용 분야에서 제한된 일반화 능력을 보여줍니다. 일부 최신 기반 모델들이 잠재력을 보이지만, 이들은 교차 작업 적응성이 부족하고 주로 제한된 크기의 저해상도 이미지를 처리하므로 고해상도 데이터를 완전히 활용하거나 대규모 장면 의미를 충분히 활용하지 못합니다. 특히, 원격 감지 이미지는 자연 이미지와 근본적으로 다르며, 주요 전경 대상(예: 해양 물체, 인공 구조물)이 종종 최소한의 공간 비율(~1%)을 차지하고 희소한 분포를 보입니다. 긴 2D 토큰(~100,000)으로부터 교차 작업 일반화 가능한 지식을 효율적으로 모델링하는 것은 원격 감지 이미지 이해에 있어 중요한 과제이지만 여전히 상당한 도전 과제로 남아 있습니다. 인간 시각 시스템에 내재된 선택적 주의 메커니즘에 영감을 받아, 우리는 원격 감지 이미지를 위한 동적 시각 인식 기반 모델인 DynamicVis를 제안합니다. 이 프레임워크는 선택적 상태 공간 모델을 기반으로 한 새로운 동적 영역 인식 백본을 통합하여, 지역적 세부 정보 추출과 전역적 맥락 통합을 전략적으로 균형 있게 조정하며, 대규모 데이터의 계산 효율적인 인코딩을 가능하게 하면서도 아키텍처 확장성을 유지합니다. 교차 작업 지식 전이를 강화하기 위해, 우리는 메타 임베딩 표현을 활용한 다중 인스턴스 학습 패러다임을 도입하고, 이를 백만 규모의 영역 수준 주석으로 학습시킵니다. 9개의 하위 작업에 대한 평가를 통해 모델의 다재다능함을 입증했습니다. DynamicVis는 (2048x2048) 픽셀을 97ms의 지연 시간(ViT의 6%)과 833MB의 GPU 메모리(ViT의 3%)로 처리하며 탁월한 효율성으로 다중 수준 특징 모델링을 달성합니다.

English

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

DynamicVis: 원격 감지 이미지 이해를 위한 효율적이고 범용적인 시각적 기초 모델

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

초록

Support