ChatPaper.aiChatPaper

用于大型语言模型高效推理的层压缩KV缓存

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

May 17, 2024
作者: Haoyi Wu, Kewei Tu
cs.AI

摘要

在实际应用中,巨大的内存消耗一直是部署高吞吐量大型语言模型的主要瓶颈。除了参数数量庞大外,变压器架构中用于注意力机制的键-值(KV)缓存在内存消耗方面也占据着重要地位,特别是对于深度语言模型中层数较多的情况。本文提出了一种新颖的方法,仅计算和缓存少量层的KVs,从而显著节省内存消耗并提高推理吞吐量。我们在大型语言模型上的实验表明,我们的方法比标准变压器实现了高达26倍的吞吐量,并在语言建模和下游任务中表现出竞争力。此外,我们的方法与现有的变压器节省内存技术正交,因此可以轻松地将它们与我们的模型集成在一起,进一步提高推理效率。我们的代码可在https://github.com/whyNLP/LCKV 获取。
English
Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26times higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.

Summary

AI-Generated Summary

PDF241December 15, 2024