DynamicVis：一種高效且通用的視覺基礎模型，用於遙感影像理解

摘要

遙感技術的進步提升了衛星影像的空間分辨率，促進了更為精細的視覺表達，以支持多樣化的解譯需求。然而，現有方法在跨應用場景中的泛化能力有限。儘管一些當代基礎模型展現出潛力，但它們受限於跨任務適應性的不足，且主要處理尺寸受限的低分辨率影像，因此未能充分利用高分辨率數據或挖掘大場景語義。關鍵在於，遙感影像與自然圖像存在本質差異，其關鍵前景目標（如海上物體、人工建築）通常僅佔極小的空間比例（約1%），且呈現稀疏分佈。從長序列的二維標記（約100,000個）中高效建模跨任務可泛化知識，既是重大挑戰，也是遙感影像理解的關鍵。受人類視覺系統固有的選擇性注意力機制啟發，我們提出了DynamicVis，一個面向遙感影像的動態視覺感知基礎模型。該框架整合了基於選擇性狀態空間模型的新型動態區域感知骨幹，策略性地平衡了局部細節提取與全局上下文整合，實現了大規模數據的高效編碼，同時保持了架構的可擴展性。為增強跨任務知識遷移，我們引入了利用元嵌入表示的多實例學習範式，並在百萬級區域標註數據上進行訓練。在九個下游任務上的評估展示了模型的廣泛適用性。DynamicVis以卓越的效率實現了多層次特徵建模，處理（2048x2048）像素僅需97毫秒延遲（為ViT的6%），並佔用833 MB GPU內存（為ViT的3%）。

English

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

DynamicVis：一種高效且通用的視覺基礎模型，用於遙感影像理解

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

摘要

Support