在第2層後，一張圖像價值1/2代幣：大視覺語言模型的即插即用推論加速。

摘要

在這項研究中，我們識別了大視覺語言模型（LVLMs）中的注意力效率問題，尤其是在知名模型如LLaVA-1.5、QwenVL-Chat和Video-LLaVA中。我們發現在流行的LVLMs的深層中，對視覺標記的注意力計算非常低效，這表明相較於處理文本數據，需要採用更為疏密的方法。為此，我們引入了FastV，一種多功能即插即用方法，旨在通過在早期層學習適應性注意力模式並在後續層修剪視覺標記，從而優化計算效率。我們的評估顯示，FastV能夠顯著降低計算成本（例如，對於LLaVA-1.5-13B，FLOPs減少45），同時在各種圖像和視頻理解任務中不會犧牲性能。FastV的計算效率和性能折衷是高度可定製和帕累托有效的。它可以壓縮一個13B參數模型的FLOPs，實現比7B參數模型更低的預算，同時仍保持優越的性能。我們相信FastV在邊緣設備和商業模型中部署LVLMs具有實際價值。代碼已發布在https://github.com/pkunlp-icler/FastV。

English

In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

在第2層後，一張圖像價值1/2代幣：大視覺語言模型的即插即用推論加速。

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

摘要

Support