用于大型语言模型高效推理的层压缩KV缓存
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
May 17, 2024
作者: Haoyi Wu, Kewei Tu
cs.AI
摘要
在实际应用中,巨大的内存消耗一直是部署高吞吐量大型语言模型的主要瓶颈。除了参数数量庞大外,变压器架构中用于注意力机制的键-值(KV)缓存在内存消耗方面也占据着重要地位,特别是对于深度语言模型中层数较多的情况。本文提出了一种新颖的方法,仅计算和缓存少量层的KVs,从而显著节省内存消耗并提高推理吞吐量。我们在大型语言模型上的实验表明,我们的方法比标准变压器实现了高达26倍的吞吐量,并在语言建模和下游任务中表现出竞争力。此外,我们的方法与现有的变压器节省内存技术正交,因此可以轻松地将它们与我们的模型集成在一起,进一步提高推理效率。我们的代码可在https://github.com/whyNLP/LCKV 获取。
English
Huge memory consumption has been a major bottleneck for deploying
high-throughput large language models in real-world applications. In addition
to the large number of parameters, the key-value (KV) cache for the attention
mechanism in the transformer architecture consumes a significant amount of
memory, especially when the number of layers is large for deep language models.
In this paper, we propose a novel method that only computes and caches the KVs
of a small number of layers, thus significantly saving memory consumption and
improving inference throughput. Our experiments on large language models show
that our method achieves up to 26times higher throughput than standard
transformers and competitive performance in language modeling and downstream
tasks. In addition, our method is orthogonal to existing transformer
memory-saving techniques, so it is straightforward to integrate them with our
model, achieving further improvement in inference efficiency. Our code is
available at https://github.com/whyNLP/LCKV.Summary
AI-Generated Summary