SARChat-Bench-2M:一個用於合成開口雷達圖像解釋的多任務視覺語言基準測試
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
February 12, 2025
作者: Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, Qingyun Pan
cs.AI
摘要
在合成孔徑雷達(SAR)遙感影像解釋領域中,儘管視覺語言模型(VLMs)在自然語言處理和影像理解方面取得了顯著進展,但由於專業領域專業知識不足,它們的應用仍然在專業領域中受到限制。本文首次創新性地提出了用於SAR影像的第一個大規模多模態對話數據集,名為SARChat-2M,包含約200萬個高質量的影像-文本對,涵蓋了各種情境並具有詳細的目標標註。該數據集不僅支持視覺理解和目標檢測等幾個關鍵任務,還具有獨特的創新方面:本研究開發了一個用於SAR領域的視覺-語言數據集和基準,從而實現並評估VLMs在SAR影像解釋中的能力,為構建跨各種遙感垂直領域的多模態數據集提供了範式框架。通過對16個主流VLMs的實驗,數據集的有效性已得到充分驗證,並成功建立了SAR領域的第一個多任務對話基準。該項目將在https://github.com/JimmyMa99/SARChat上發布,旨在促進SAR視覺語言模型的深入發展和廣泛應用。
English
In the field of synthetic aperture radar (SAR) remote sensing image
interpretation, although Vision language models (VLMs) have made remarkable
progress in natural language processing and image understanding, their
applications remain limited in professional domains due to insufficient domain
expertise. This paper innovatively proposes the first large-scale multimodal
dialogue dataset for SAR images, named SARChat-2M, which contains approximately
2 million high-quality image-text pairs, encompasses diverse scenarios with
detailed target annotations. This dataset not only supports several key tasks
such as visual understanding and object detection tasks, but also has unique
innovative aspects: this study develop a visual-language dataset and benchmark
for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image
interpretation, which provides a paradigmatic framework for constructing
multimodal datasets across various remote sensing vertical domains. Through
experiments on 16 mainstream VLMs, the effectiveness of the dataset has been
fully verified, and the first multi-task dialogue benchmark in the SAR field
has been successfully established. The project will be released at
https://github.com/JimmyMa99/SARChat, aiming to promote the in-depth
development and wide application of SAR visual language models.Summary
AI-Generated Summary