為了有效推理大型語言模型,我們提出了Layer-Condensed KV快取。
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
May 17, 2024
作者: Haoyi Wu, Kewei Tu
cs.AI
摘要
在實際應用中,龐大的記憶體消耗一直是部署高吞吐量大型語言模型的主要瓶頸。除了參數數量龐大外,在變壓器架構中用於注意力機制的鍵-值(KV)快取消耗了大量記憶體,特別是對於深度語言模型中層數較多時。本文提出了一種新方法,僅計算並快取少數層的KVs,從而顯著節省記憶體消耗並提高推論吞吐量。我們在大型語言模型上的實驗表明,我們的方法比標準變壓器實現高達26倍的吞吐量,並在語言建模和下游任務中表現出競爭力。此外,我們的方法與現有的變壓器節省記憶體技術正交,因此很容易將它們與我們的模型整合在一起,實現進一步提高推論效率。我們的程式碼可在 https://github.com/whyNLP/LCKV 找到。
English
Huge memory consumption has been a major bottleneck for deploying
high-throughput large language models in real-world applications. In addition
to the large number of parameters, the key-value (KV) cache for the attention
mechanism in the transformer architecture consumes a significant amount of
memory, especially when the number of layers is large for deep language models.
In this paper, we propose a novel method that only computes and caches the KVs
of a small number of layers, thus significantly saving memory consumption and
improving inference throughput. Our experiments on large language models show
that our method achieves up to 26times higher throughput than standard
transformers and competitive performance in language modeling and downstream
tasks. In addition, our method is orthogonal to existing transformer
memory-saving techniques, so it is straightforward to integrate them with our
model, achieving further improvement in inference efficiency. Our code is
available at https://github.com/whyNLP/LCKV.Summary
AI-Generated Summary