ChatPaper.aiChatPaper

通過修剪跨注意力視覺特徵實現高效的LLaMA-3.2-Vision

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

April 1, 2025
作者: Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim
cs.AI

摘要

視覺令牌削減降低了大型視覺語言模型(LVLMs)中因大量圖像特徵而產生的推理成本。與相關研究僅在自注意力機制的LVLMs中修剪令牌不同,我們的工作獨特地針對基於交叉注意力的模型,這些模型能實現更優異的性能。我們發現,在交叉注意力層中,圖像令牌的鍵值(KV)緩存大小顯著超過自注意力層中的文本令牌,成為主要的計算瓶頸。為緩解此問題,我們利用交叉注意力圖中的稀疏性來選擇性地修剪冗餘的視覺特徵。我們的Trimmed Llama有效降低了KV緩存需求,且無需額外訓練。通過受益於50%減少的視覺特徵,我們的模型能夠在保持基準性能的同時,降低推理延遲和記憶體使用量。
English
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Summary

AI-Generated Summary

PDF152April 2, 2025