DynamicVis：一种高效通用的遥感图像理解视觉基础模型

摘要

遥感技术的进步提升了卫星影像的空间分辨率，为多样化的解译提供了更精细的视觉表达。然而，现有方法在不同应用场景下的泛化能力有限。尽管一些当代基础模型展现出潜力，但它们受限于跨任务适应性的不足，且主要处理尺寸受限的低分辨率影像，未能充分利用高分辨率数据或挖掘大规模场景语义。关键在于，遥感影像与自然图像存在本质差异，其关键前景目标（如海上物体、人工建筑）通常仅占极小空间比例（约1%），且分布稀疏。从长达约100,000的二维标记中高效建模跨任务可泛化知识，虽极具挑战性，却是遥感影像理解的核心。受人类视觉系统选择性注意机制的启发，我们提出了DynamicVis，一种面向遥感影像的动态视觉感知基础模型。该框架创新性地集成了基于选择性状态空间模型的动态区域感知骨干网络，巧妙平衡了局部细节提取与全局上下文整合，实现了大规模数据的高效编码，同时保持了架构的可扩展性。为增强跨任务知识迁移，我们引入了利用元嵌入表示的多实例学习范式，并在百万级区域标注上进行训练。在九项下游任务上的评估验证了模型的广泛适用性。DynamicVis以卓越的效率实现了多层次特征建模，处理2048x2048像素仅需97毫秒（为ViT的6%），GPU内存占用833MB（为ViT的3%）。

English

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

DynamicVis：一种高效通用的遥感图像理解视觉基础模型

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

摘要

Support