AgilePruner: 대규모 시각-언어 모델에서 적응형 시각 토큰 프루닝을 위한 어텐션과 다양성에 관한 실증적 연구

초록

대규모 시각-언어 모델(LVLMs)은 방대한 시각 토큰 시퀀스로 인한 상당한 계산 부담을 완화하기 위해 시각 토큰 프루닝(pruning) 전략을 채택해 왔습니다. 기존 연구들은 주로 어텐션(attention) 기반 또는 다양성(diversity) 기반 프루닝 방법에 초점을 맞추었으나, 이러한 접근법의 특성과 한계에 대한 심층적인 분석은 거의 이뤄지지 않았습니다. 본 연구에서는 효과적 랭크(effective rank, erank)를 특징 다양성의 측정치로, 그리고 어텐션 점수 엔트로피(entropy)를 활용하여 시각 토큰 처리 메커니즘을 조사하고 각 접근법의 강점과 약점을 분석하는 체계적인 실증 분석을 수행합니다. 우리의 분석은 두 가지 통찰을 보여줍니다: (1) erank 기반 정량 분석에 따르면, 다양성 중심 프루닝 방법들 중 상당수가 의도한 것보다 훨씬 적은 특징 다양성을 보존합니다. 더 나아가 CHAIR 데이터셋을 이용한 분석은, 이들이 보존하는 다양성이 어텐션 기반 프루닝에 비해 증가된 환각(hallucination) 발생 빈도와 밀접하게 연관되어 있음을 보여줍니다. (2) 우리는 또한 어텐션 기반 접근법이 시각적 증거가 집중된 단순한 이미지에서 더 효과적인 반면, 다양성 기반 방법은 특징이 분산된 복잡한 이미지를 더 잘 처리한다는 점을 관찰합니다. 이러한 실증적 통찰을 바탕으로, 기존의 하이브리드(hybrid) 프루닝 전략에 이미지 인식(image-aware) 조정을 도입하면 그 성능이 지속적으로 향상됨을 보여줍니다. 또한, 우리는 간단한 적응형 프루닝 메커니즘을 통해 우리의 실증 결과를 최소한으로 구현하며, 이 메커니즘이 표준 벤치마크와 환각 특화 평가 모두에서 강력하고 안정적인 성능을 달성함을 입증합니다. 우리의 프로젝트 페이지는 https://cvsp-lab.github.io/AgilePruner에서 확인할 수 있습니다.

English

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.

AgilePruner: 대규모 시각-언어 모델에서 적응형 시각 토큰 프루닝을 위한 어텐션과 다양성에 관한 실증적 연구

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

초록

Support