DynamicVis: リモートセンシング画像理解のための効率的で汎用的な視覚基盤モデル

要旨

リモートセンシング技術の進化により、衛星画像の空間分解能が向上し、多様な解釈のための詳細な視覚表現が容易になりました。しかし、既存の手法は様々なアプリケーションにわたる汎化能力に限界があります。一部の現代的な基盤モデルは潜在能力を示していますが、クロスタスク適応性の不足や、主に制限されたサイズの低解像度画像を処理することに制約されており、高解像度データを十分に活用したり、大規模シーンの意味論を活用したりすることができません。重要なことに、リモートセンシング画像は自然画像とは根本的に異なり、主要な前景ターゲット（例：海上物体、人工構造物）はしばしば最小限の空間割合（約1%）を占め、疎な分布を示します。長大な2Dトークン（約100,000）からクロスタスク汎化可能な知識を効率的にモデル化することは大きな課題でありながら、リモートセンシング画像理解にとって重要です。人間の視覚システムに内在する選択的注意メカニズムに動機づけられ、我々はDynamicVisを提案します。これはリモートセンシング画像のための動的視覚知覚基盤モデルです。このフレームワークは、選択的状態空間モデルに基づく新しい動的領域知覚バックボーンを統合し、局所的な詳細抽出とグローバルな文脈統合を戦略的にバランスさせ、大規模データの計算効率の良いエンコーディングを可能にしつつ、アーキテクチャのスケーラビリティを維持します。クロスタスク知識転移を強化するために、メタ埋め込み表現を利用したマルチインスタンス学習パラダイムを導入し、百万規模の領域レベルアノテーションで訓練します。9つの下流タスクにわたる評価は、モデルの汎用性を示しています。DynamicVisは、例外的な効率で多レベル特徴モデリングを達成し、（2048x2048）ピクセルを97ミリ秒の遅延（ViTの6%）と833MBのGPUメモリ（ViTの3%）で処理します。

English

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

DynamicVis: リモートセンシング画像理解のための効率的で汎用的な視覚基盤モデル

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

要旨

Support