基于OSM的遥感视觉语言模型领域自适应
OSM-based Domain Adaptation for Remote Sensing VLMs
March 12, 2026
作者: Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel
cs.AI
摘要
面向遙感領域的視覺語言模型長期依賴特定領域的圖像-文本監督數據,然而衛星與航空影像的高質量標註依然稀缺且製作成本高昂。主流偽標註流程通過從大型前沿模型提煉知識來彌補這一缺口,但這種對大型教師模型的依賴不僅成本高昂、限制可擴展性,更將模型性能上限鎖定在教師模型水平。我們提出OSMDA:一種自包含的領域適應框架以消除這種依賴。核心思路在於,具備基礎能力的VLM可作為自身的標註引擎——通過將航空影像與OpenStreetMap(OSM)渲染圖塊配對,我們利用模型的光學字符識別與圖表理解能力,生成融合OSM海量輔助元數據的圖像描述。隨後僅使用衛星影像對模型進行微調,最終獲得無需人工標註、無需外部強模型的領域適應型VLM(OSMDA-VLM)。我們在10個圖像-文本到文本任務基準上進行全面評估,並與9個競爭基線模型對比。當與真實數據等量混合時,本方法實現了最優性能,且訓練成本顯著低於依賴教師模型的方案。這些結果表明,在擁有強基礎模型的前提下,與眾包地理數據對接是實現遙感領域適應的實用且可擴展路徑。數據集與模型權重將公開釋出。
English
Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.