基于OSM的遥感视觉语言模型领域自适应

摘要

面向遥感领域优化的视觉语言模型严重依赖特定领域的图像-文本监督数据，然而卫星与航空影像的高质量标注仍然稀缺且制作成本高昂。主流伪标注流程通过从大型前沿模型蒸馏知识来弥补这一缺口，但这种对大型教师模型的依赖不仅成本高昂、限制可扩展性，其性能上限也被教师模型所禁锢。我们提出OSMDA：一种自包含的领域自适应框架以消除这种依赖。我们的核心发现是，具备基础能力的VLM可自成标注引擎——通过将航拍图像与OpenStreetMap渲染图块配对，利用模型的字符识别和图表理解能力，结合OSM海量辅助元数据生成增强型描述文本。随后仅使用卫星影像对模型进行微调，最终获得无需人工标注且不依赖外部强模型的领域自适应VLM（OSMDA-VLM）。我们在10个图像-文本到文本任务基准上展开全面评估，并与9个竞争基线对比。当与真实数据等量混合时，本方法实现了最先进性能，且训练成本显著低于依赖教师模型的方案。这些结果表明：在拥有强基础模型的前提下，与众包地理数据对齐是实现遥感领域自适应的可行且可扩展路径。数据集与模型权重将公开提供。

English

Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.