通過修剪跨注意力視覺特徵實現高效的LLaMA-3.2-Vision
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
April 1, 2025
作者: Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim
cs.AI
摘要
視覺令牌削減降低了大型視覺語言模型(LVLMs)中因大量圖像特徵而產生的推理成本。與相關研究僅在自注意力機制的LVLMs中修剪令牌不同,我們的工作獨特地針對基於交叉注意力的模型,這些模型能實現更優異的性能。我們發現,在交叉注意力層中,圖像令牌的鍵值(KV)緩存大小顯著超過自注意力層中的文本令牌,成為主要的計算瓶頸。為緩解此問題,我們利用交叉注意力圖中的稀疏性來選擇性地修剪冗餘的視覺特徵。我們的Trimmed Llama有效降低了KV緩存需求,且無需額外訓練。通過受益於50%減少的視覺特徵,我們的模型能夠在保持基準性能的同時,降低推理延遲和記憶體使用量。
English
Visual token reduction lowers inference costs caused by extensive image
features in large vision-language models (LVLMs). Unlike relevant studies that
prune tokens in self-attention-only LVLMs, our work uniquely addresses
cross-attention-based models, which achieve superior performance. We identify
that the key-value (KV) cache size for image tokens in cross-attention layers
significantly exceeds that of text tokens in self-attention layers, posing a
major compute bottleneck. To mitigate this issue, we exploit the sparse nature
in cross-attention maps to selectively prune redundant visual features. Our
Trimmed Llama effectively reduces KV cache demands without requiring additional
training. By benefiting from 50%-reduced visual features, our model can reduce
inference latency and memory usage while achieving benchmark parity.Summary
AI-Generated Summary