圧縮への一瞥：大規模視覚言語モデルのための動的視覚トークンプルーニング

要旨

視覚トークンの圧縮は、大規模視覚言語モデル（LVLMs）が高解像度入力を効率的に処理する上で極めて重要である。既存の手法では、一般的に固定された圧縮率を採用しているため、複雑さが異なるシーンに適応できず、情報量の多い視覚トークンを不正確に削除してしまい、モデルの性能低下を引き起こすことが多い。この問題を解決するため、我々は人間の認知に着想を得た動的プルーニングフレームワーク「GlimpsePrune」を提案する。この手法は、データ駆動型の「一瞥（glimpse）」を行い、回答生成前に単一のフォワードパスで無関係な視覚トークンを削除する。このアプローチにより、視覚トークンの92.6%を削減しながら、自由形式の視覚質問応答（VQA）タスクにおいてベースライン性能を平均的に完全に維持することができる。また、計算コストの削減により、より効果的なファインチューニングも可能となり、強化版の「GlimpsePrune+」は、同様に高いプルーニング率を維持しながら、ベースライン性能の110%を達成する。本研究は、より強力で効率的なLVLMsを構築するための新たな道を切り開くものである。

English

Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

圧縮への一瞥：大規模視覚言語モデルのための動的視覚トークンプルーニング

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

要旨

Support