ZipVL：具有动态令牌稀疏化和KV缓存压缩的高效大规模视觉语言模型

摘要

大型视觉语言模型（LVLMs）的效率受到计算瓶颈的限制，在预填充阶段的注意力机制和解码阶段中提取关键-值（KV）缓存的内存瓶颈，尤其在涉及高分辨率图像或视频的情况下。视觉内容通常表现出相当多的冗余，导致LVLMs内高度稀疏的注意力图。这种稀疏性可以通过各种方法加速注意力计算或压缩KV缓存。然而，大多数研究集中解决这两个瓶颈中的一个，并未充分支持关于不同层或任务的稀疏性动态调整。在本文中，我们提出了ZipVL，一种为LVLMs设计的高效推理框架，通过重要令牌的动态比例分配策略解决计算和内存瓶颈。这个比例是根据层特定的注意力分数分布自适应确定的，而不是固定的超参数，从而提高了对较简单任务的效率，同时保持了对更具挑战性任务的高性能。然后，我们基于它们的标准化注意力分数选择重要令牌，并仅对这些重要令牌执行注意力机制以加速预填充阶段。为了减轻解码阶段的内存瓶颈，我们对KV缓存采用混合精度量化，其中对于重要令牌的缓存使用高位量化，而对于不太重要的则应用低位量化。我们的实验表明，ZipVL可以将预填充阶段加速2.6倍，并将GPU内存使用减少50.0%，在LongVA-7B模型的Video-MME基准上仅减少0.2%的准确率，有效提高了LVLMs的生成效率。

English

The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6times and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.

ZipVL：具有动态令牌稀疏化和KV缓存压缩的高效大规模视觉语言模型

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

摘要

Support