大規模言語モデルの効率的な推論のためのレイヤー凝縮KVキャッシュ

要旨

大規模なメモリ消費は、高スループットの大規模言語モデルを実世界のアプリケーションに展開する上で主要なボトルネックとなってきました。パラメータ数の多さに加えて、Transformerアーキテクチャにおける注意機構のキー・バリュー（KV）キャッシュも、特に深層言語モデルにおいて層数が多い場合に、大量のメモリを消費します。本論文では、少数の層のKVのみを計算してキャッシュするという新たな手法を提案し、メモリ消費を大幅に削減し、推論スループットを向上させます。大規模言語モデルにおける実験では、本手法が標準的なTransformerと比較して最大26倍のスループットを達成し、言語モデリングおよび下流タスクにおいて競争力のある性能を発揮することを示しています。さらに、本手法は既存のTransformerのメモリ節約技術と直交するため、それらを当モデルと簡単に統合でき、推論効率をさらに向上させることができます。コードはhttps://github.com/whyNLP/LCKVで公開されています。

English

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26times higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.

大規模言語モデルの効率的な推論のためのレイヤー凝縮KVキャッシュ

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

要旨

Summary

Support

Support