クロスレイヤーアテンションによるTransformerのキー・バリューキャッシュサイズの削減

要旨

キー・バリュー（KV）キャッシングは、トランスフォーマーベースの自己回帰型大規模言語モデル（LLM）のデコードを加速する上で重要な役割を果たします。しかし、長いシーケンス長や大きなバッチサイズでは、KVキャッシュを保存するために必要なメモリ量が過大になる可能性があります。トランスフォーマーの発明以来、KVキャッシュのサイズを削減するために発見された最も効果的な手法の2つは、マルチクエリ注意機構（MQA）とその一般化であるグループ化クエリ注意機構（GQA）です。MQAとGQAはどちらも、複数のクエリヘッドが単一のキー/バリューヘッドを共有できるように注意ブロックの設計を変更し、精度を最小限に低下させながら、異なるキー/バリューヘッドの数を大幅に削減します。本論文では、マルチクエリ注意機構をさらに進化させ、隣接する層間でキーとバリューヘッドを共有することで、新たな注意設計である「クロスレイヤー注意機構（CLA）」を提案します。CLAを用いることで、KVキャッシュのサイズをさらに2倍削減しつつ、未修正のMQAとほぼ同等の精度を維持できることがわかりました。1Bおよび3Bパラメータのモデルをゼロからトレーニングする実験では、CLAが従来のMQAで可能なメモリ/精度のトレードオフをパレート改善し、より長いシーケンス長と大きなバッチサイズでの推論を可能にすることが示されました。

English

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible

クロスレイヤーアテンションによるTransformerのキー・バリューキャッシュサイズの削減

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

要旨

Summary

Support

Support