AgilePruner: 大規模視覚言語モデルにおける適応的視覚トークンプルーニングのための注意機構と多様性に関する実証的研究

要旨

大規模視覚言語モデル（LVLM）は、大量の視覚トークン系列によって生じる多大な計算コストを軽減するために、視覚トークン剪定戦略を採用している。従来の研究は主に注意機構ベースまたは多様性ベースの剪定手法に焦点を当ててきたが、これらの手法の特性と限界に関する詳細な分析はほとんど行われていない。本研究では、特徴の多様性を測る有効ランク（erank）と注意スコアのエントロピーを用いて、視覚トークン処理メカニズムを実証的に分析し、各手法の長所と短所を明らかにする。分析から得られた知見は二つある：（1）erankに基づく定量的分析により、多様性を重視する剪定手法の多くが意図したよりもはるかに少ない特徴多様性しか保持していないことが明らかになった。さらに、CHAIRデータセットを用いた分析から、それらが保持する多様性は、注意ベースの剪定と比較して、幻覚（hallucination）の発生頻度の高さと密接に関連していることが示された。（2）注意ベースの手法は視覚的証拠が集中する単純な画像でより効果的であるのに対し、多様性ベースの手法は特徴が分散した複雑な画像により適していることが観察された。これらの実証的知見に基づき、既存のハイブリッド剪定戦略に画像認識に基づく調整を組み込むことで、性能が一貫して向上することを示す。また、本知見を具現化した最小限の適応型剪定メカニズムを提案し、標準ベンチマークおよび幻覚特化評価の両方において、強固かつ信頼性の高い性能を達成する。プロジェクトページはhttps://cvsp-lab.github.io/AgilePrunerで公開されている。

English

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.

AgilePruner: 大規模視覚言語モデルにおける適応的視覚トークンプルーニングのための注意機構と多様性に関する実証的研究

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

要旨

Support