一瞥即压缩：面向大型视觉-语言模型的动态视觉令牌剪枝

摘要

视觉令牌压缩对于大型视觉语言模型（LVLMs）高效处理高分辨率输入至关重要。现有方法通常采用固定的压缩比率，无法适应不同复杂度的场景，常常导致不精确的剪枝，丢弃了信息丰富的视觉令牌，从而降低了模型性能。为解决这一问题，我们引入了一种受人类认知启发的动态剪枝框架——GlimpsePrune。该框架通过数据驱动的“一瞥”方式，在生成答案前的一次前向传播中剪除无关的视觉令牌。此方法剪除了92.6%的视觉令牌，同时在自由形式视觉问答任务上平均完全保持了基线性能。降低的计算成本还使得微调更为有效：增强版的GlimpsePrune+在保持相似高剪枝率的同时，达到了基线性能的110%。我们的工作为构建更强大、更高效的LVLMs开辟了新途径。

English

Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

一瞥即压缩：面向大型视觉-语言模型的动态视觉令牌剪枝

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

摘要

Support