AgilePruner：大型视觉语言模型中自适应视觉令牌剪枝的注意力与多样性实证研究

摘要

大型视觉语言模型（LVLMs）采用视觉令牌剪枝策略以缓解大量视觉令牌序列带来的巨大计算开销。尽管现有研究主要关注基于注意力或基于多样性的剪枝方法，但针对这些方法特性与局限性的深入分析仍属空白。本研究通过使用有效秩值（erank）作为特征多样性度量指标和注意力分数熵值，对视觉令牌处理机制展开全面实证分析，系统剖析了各类方法的优缺点。我们的分析揭示了两项关键发现：（1）基于erank的定量分析表明，许多面向多样性的剪枝方法所保留的特征多样性远低于预期；此外，通过CHAIR数据集的分析发现，相较于基于注意力的剪枝，这些方法保留的多样性反而与更高的幻觉频率密切相关。（2）我们进一步观察到，基于注意力的方法在处理视觉证据集中的简单图像时更为有效，而基于多样性的方法则更擅长处理具有分散特征的复杂图像。基于这些实证发现，我们证明了将图像自适应调整融入现有混合剪枝策略能持续提升其性能。同时，我们通过一个简易的自适应剪枝机制对实证发现进行了最小化实例化，该机制在标准基准测试及幻觉专项评估中均实现了稳定强劲的性能。项目页面详见：https://cvsp-lab.github.io/AgilePruner。

English

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.

AgilePruner：大型视觉语言模型中自适应视觉令牌剪枝的注意力与多样性实证研究

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

摘要

Support