MLKV：多層鍵值頭用於記憶效率高的Transformer解碼

摘要

Transformer 模型的自回歸推理大大受益於鍵-值（KV）緩存，但隨著模型大小、批次大小和序列長度的增長，可能導致主要的記憶體瓶頸。我們引入了多層鍵-值（MLKV）共享，這是一種新穎的方法，將 KV 共享擴展到 Transformer 層，以降低記憶體使用量，超越了多查詢注意力（MQA）和分組查詢注意力（GQA）所能實現的範圍。通過在各種自然語言處理基準測試和推理指標上使用經過訓練的 Pythia-160M 變體進行評估，顯示 MLKV 顯著降低了記憶體使用量，並且性能損失最小，將 KV 緩存大小降低到 MQA 的 6 倍。這些結果突顯了 MLKV 在大規模部署 Transformer 模型時的高效潛力。我們在 https://github.com/zaydzuhri/pythia-mlkv 提供了程式碼。

English

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

MLKV：多層鍵值頭用於記憶效率高的Transformer解碼

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

摘要

Support