敏捷剪枝器：大型视觉语言模型中视觉令牌自适应剪uning的注意力与多样性实证研究

摘要

大型视觉语言模型（LVLMs）采用视觉令牌剪枝策略以缓解大量视觉令牌序列带来的显著计算开销。尽管现有研究主要关注基于注意力或基于多样性的剪枝方法，但对其特性与局限性的深入分析仍属空白。本研究通过有效秩（erank）衡量特征多样性，结合注意力得分熵进行系统实证分析，探究视觉令牌处理机制并剖析各类方法的优缺点。我们的分析揭示了两点发现：（1）基于erank的定量分析表明，许多以多样性为导向的剪枝方法保留的特征多样性远低于预期；此外，利用CHAIR数据集的分析显示，相较于注意力剪枝，这些方法保留的多样性反而与更高的幻觉频率密切相关。（2）我们进一步观察到，基于注意力的方法在处理视觉证据集中的简单图像时更有效，而基于多样性的方法更擅长处理具有分散特征的复杂图像。基于这些实证发现，我们通过在现有混合剪枝策略中引入图像感知调整，持续提升了其性能。我们还通过一个简易的自适应剪枝机制对实证发现进行最小化实例化，该机制在标准基准测试及幻觉专项评估中均展现出稳健可靠的性能。项目页面详见https://cvsp-lab.github.io/AgilePruner。

English

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.