ChatPaper.aiChatPaper

GeoPixel:遙感中的像素定位大型多模型模型

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

January 23, 2025
作者: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan
cs.AI

摘要

最近在大型多模型(LMMs)方面的進展已經認識到精細細節的接地作為視覺理解和對話的一個必要因素。然而,這種表示在自然圖像領域的LMMs中的好處僅限於遙感(RS)表現不佳。高分辨率RS圖像中的不同俯視角度、尺度變化和小物體的存在提出了區域級理解中的獨特挑戰。此外,在RS中LMMs接地對話能力的發展受到缺乏細粒度、RS領域特定接地數據的阻礙。為了解決這些限制,我們提出了GeoPixel - 第一個端到端高分辨率RS-LMM,支持像素級接地。這種能力通過在對話中生成交錯遮罩來實現精細細節的視覺感知。GeoPixel支持任何長寬比的4K高清分辨率,非常適合高精度RS圖像分析。為了支持RS圖像中接地對話生成(GCG),我們通過一個半自動化流程策劃了一個視覺接地數據集GeoPixelD,該流程利用針對RS數據量身定制的標記提示和空間先驗來系統地控制數據生成過程。GeoPixel在像素級理解方面表現出優越性,超越現有的LMMs在單目標和多目標分割任務中。我們的方法論消融研究驗證了整體架構中每個組件的有效性。我們的代碼和數據將公開發布。
English
Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.

Summary

AI-Generated Summary

PDF82January 27, 2025