图像在第2层之后价值1/2代币：用于大型视觉-语言模型的即插即用推理加速。

摘要

在这项研究中，我们发现大型视觉-语言模型（LVLMs）中存在效率低下的注意力现象，尤其是在知名模型如LLaVA-1.5、QwenVL-Chat和Video-LLaVA中。我们发现在流行的LVLMs的深层中，对视觉标记的注意力计算极其低效，这表明与处理文本数据相比，需要采用更稀疏的方法。为此，我们引入了FastV，这是一种多功能即插即用方法，旨在通过学习早期层中的自适应注意力模式和在后续层中修剪视觉标记来优化计算效率。我们的评估表明，FastV能够显著降低计算成本（例如，对于LLaVA-1.5-13B，FLOPs减少了45），而在各种图像和视频理解任务中不会牺牲性能。FastV的计算效率和性能权衡是高度可定制的，也是帕累托有效的。它可以压缩一个拥有13B参数的模型的FLOPs，以实现比一个拥有7B参数模型更低的预算，同时仍保持出色的性能。我们相信FastV对于在边缘设备和商业模型中部署LVLMs具有实际价值。代码已发布在https://github.com/pkunlp-icler/FastV。

English

In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

图像在第2层之后价值1/2代币：用于大型视觉-语言模型的即插即用推理加速。

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

摘要

Support