MLKV:多層鍵值頭用於記憶效率高的Transformer解碼
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
June 13, 2024
作者: Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji
cs.AI
摘要
Transformer 模型的自回歸推理大大受益於鍵-值(KV)緩存,但隨著模型大小、批次大小和序列長度的增長,可能導致主要的記憶體瓶頸。我們引入了多層鍵-值(MLKV)共享,這是一種新穎的方法,將 KV 共享擴展到 Transformer 層,以降低記憶體使用量,超越了多查詢注意力(MQA)和分組查詢注意力(GQA)所能實現的範圍。通過在各種自然語言處理基準測試和推理指標上使用經過訓練的 Pythia-160M 變體進行評估,顯示 MLKV 顯著降低了記憶體使用量,並且性能損失最小,將 KV 緩存大小降低到 MQA 的 6 倍。這些結果突顯了 MLKV 在大規模部署 Transformer 模型時的高效潛力。我們在 https://github.com/zaydzuhri/pythia-mlkv 提供了程式碼。
English
Auto-regressive inference of transformers benefit greatly from Key-Value (KV)
caching, but can lead to major memory bottlenecks as model size, batch size,
and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV)
sharing, a novel approach extending KV sharing across transformer layers to
reduce memory usage beyond what was possible with Multi-Query Attention (MQA)
and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and
inference metrics using uptrained Pythia-160M variants demonstrate that MLKV
significantly reduces memory usage with minimal performance loss, reducing KV
cache size down to a factor of 6x compared to MQA. These results highlight
MLKV's potential for efficient deployment of transformer models at scale. We
provide code at https://github.com/zaydzuhri/pythia-mlkvSummary
AI-Generated Summary