一瞥即壓縮：大型視覺語言模型的動態視覺標記修剪

摘要

視覺標記壓縮對於大型視覺語言模型（LVLMs）高效處理高分辨率輸入至關重要。現有方法通常採用固定壓縮比率，無法適應不同複雜度的場景，往往導致不精確的剪枝，丟失信息豐富的視覺標記，從而降低模型性能。為解決這一問題，我們受人類認知啟發，引入了一種動態剪枝框架——GlimpsePrune。該框架在生成答案前，通過數據驅動的「一瞥」方式，在單次前向傳播中剪除不相關的視覺標記。此方法剪除了92.6%的視覺標記，同時在自由形式的視覺問答任務上平均完全保留了基線性能。降低的計算成本還使得微調更加有效：增強版的GlimpsePrune+在保持相似高剪枝率的同時，達到了基線性能的110%。我們的工作為構建更強大且高效的LVLMs開闢了新途徑。

English

Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

一瞥即壓縮：大型視覺語言模型的動態視覺標記修剪

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

摘要

Support