通過修剪跨注意力視覺特徵實現高效的LLaMA-3.2-Vision

摘要

視覺令牌削減降低了大型視覺語言模型（LVLMs）中因大量圖像特徵而產生的推理成本。與相關研究僅在自注意力機制的LVLMs中修剪令牌不同，我們的工作獨特地針對基於交叉注意力的模型，這些模型能實現更優異的性能。我們發現，在交叉注意力層中，圖像令牌的鍵值（KV）緩存大小顯著超過自注意力層中的文本令牌，成為主要的計算瓶頸。為緩解此問題，我們利用交叉注意力圖中的稀疏性來選擇性地修剪冗餘的視覺特徵。我們的Trimmed Llama有效降低了KV緩存需求，且無需額外訓練。通過受益於50%減少的視覺特徵，我們的模型能夠在保持基準性能的同時，降低推理延遲和記憶體使用量。

English

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

通過修剪跨注意力視覺特徵實現高效的LLaMA-3.2-Vision

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

摘要

Support