為了有效推理大型語言模型，我們提出了Layer-Condensed KV快取。

摘要

在實際應用中，龐大的記憶體消耗一直是部署高吞吐量大型語言模型的主要瓶頸。除了參數數量龐大外，在變壓器架構中用於注意力機制的鍵-值（KV）快取消耗了大量記憶體，特別是對於深度語言模型中層數較多時。本文提出了一種新方法，僅計算並快取少數層的KVs，從而顯著節省記憶體消耗並提高推論吞吐量。我們在大型語言模型上的實驗表明，我們的方法比標準變壓器實現高達26倍的吞吐量，並在語言建模和下游任務中表現出競爭力。此外，我們的方法與現有的變壓器節省記憶體技術正交，因此很容易將它們與我們的模型整合在一起，實現進一步提高推論效率。我們的程式碼可在 https://github.com/whyNLP/LCKV 找到。

English

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26times higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.

為了有效推理大型語言模型，我們提出了Layer-Condensed KV快取。

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

摘要

Summary

Support

Support