大規模視覚言語モデルが大規模リモートセンシング画像に出会うとき：テキスト誘導型トークンプルーニングの粗から細へのアプローチ

要旨

大規模なリモートセンシング画像（RSI）の効率的な視覚-言語理解は意義深いが、挑戦的な課題である。現在の大規模視覚-言語モデル（LVLM）は、通常、限定的な事前定義グリッドを使用して画像を処理するため、ギガピクセルRSIを扱う際に情報の損失が生じる。一方、無制限のグリッドを使用すると、計算コストが大幅に増加する。画像の詳細を保持しつつ計算複雑性を低減するために、我々はテキストガイド型トークンプルーニング手法をDynamic Image Pyramid（DIP）と統合して提案する。我々の手法は、(i) テキストを意識した領域位置特定能力を活用して重要な視覚トークンを識別するRegion Focus Module（RFM）と、(ii) DIPに基づく粗から細への画像タイル選択および視覚トークンプルーニング戦略を導入する。これにより、RFMの出力に基づいてガイドされ、大規模な画像全体を直接処理することを回避する。さらに、大規模RSIに対するLVLMの知覚能力を評価する既存のベンチマークは、質問の多様性が限られており、画像サイズも制約されている。我々は、8カテゴリにわたる7,333のQAペアを含み、画像の長さが最大27,328ピクセルに及ぶ新しいベンチマーク「LRS-VQA」を構築した。我々の手法は、同じデータを使用して4つのデータセットにおいて既存の高解像度戦略を上回る。さらに、既存のトークン削減手法と比較して、高解像度設定下でより高い効率性を示す。データセットとコードはhttps://github.com/VisionXLab/LRS-VQAに公開されている。

English

Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.

大規模視覚言語モデルが大規模リモートセンシング画像に出会うとき：テキスト誘導型トークンプルーニングの粗から細へのアプローチ

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

要旨

Support