當大型視覺語言模型遇見大規模遙感影像: 從粗到細的文本引導式令牌剪枝
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
March 10, 2025
作者: Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li
cs.AI
摘要
高效理解大規模遙感圖像(RSIs)的視覺語言具有重要意義但極具挑戰性。當前的大型視覺語言模型(LVLMs)通常採用有限的預定義網格來處理圖像,這在處理千兆像素級RSIs時會導致信息丟失。相反,使用無限網格則會顯著增加計算成本。為在降低計算複雜度的同時保留圖像細節,我們提出了一種結合動態圖像金字塔(DIP)的文本引導令牌剪枝方法。我們的方法包括:(i)一個區域聚焦模塊(RFM),利用文本感知的區域定位能力來識別關鍵視覺令牌;(ii)基於DIP的從粗到細的圖像瓦片選擇和視覺令牌剪枝策略,該策略由RFM輸出引導,避免直接處理整個大圖像。此外,現有評估LVLMs在大規模RSI上感知能力的基準存在問題多樣性有限和圖像尺寸受限的問題。我們構建了一個名為LRS-VQA的新基準,包含8個類別的7,333個問答對,圖像長度可達27,328像素。在相同數據下,我們的方法在四個數據集上優於現有的高分辨率策略。此外,與現有的令牌減少方法相比,我們的方法在高分辨率設置下表現出更高的效率。數據集和代碼可在https://github.com/VisionXLab/LRS-VQA獲取。
English
Efficient vision-language understanding of large Remote Sensing Images (RSIs)
is meaningful but challenging. Current Large Vision-Language Models (LVLMs)
typically employ limited pre-defined grids to process images, leading to
information loss when handling gigapixel RSIs. Conversely, using unlimited
grids significantly increases computational costs. To preserve image details
while reducing computational complexity, we propose a text-guided token pruning
method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i)
a Region Focus Module (RFM) that leverages text-aware region localization
capability to identify critical vision tokens, and (ii) a coarse-to-fine image
tile selection and vision token pruning strategy based on DIP, which is guided
by RFM outputs and avoids directly processing the entire large imagery.
Additionally, existing benchmarks for evaluating LVLMs' perception ability on
large RSI suffer from limited question diversity and constrained image sizes.
We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs
across 8 categories, with image length up to 27,328 pixels. Our method
outperforms existing high-resolution strategies on four datasets using the same
data. Moreover, compared to existing token reduction methods, our approach
demonstrates higher efficiency under high-resolution settings. Dataset and code
are in https://github.com/VisionXLab/LRS-VQA.Summary
AI-Generated Summary