ChatPaper.aiChatPaper

MLKV:用于内存高效Transformer解码的多层键-值头

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

June 13, 2024
作者: Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji
cs.AI

摘要

基于自回归的Transformer推断极大受益于键-值(KV)缓存,但随着模型大小、批量大小和序列长度的规模增长,可能导致主要的内存瓶颈。我们引入了多层键-值(MLKV)共享,这是一种新颖的方法,将KV共享扩展到Transformer层之间,以减少内存使用量,超出了使用多查询注意力(MQA)和分组查询注意力(GQA)时的可能性。在各种自然语言处理基准测试和推断指标上使用经过训练的Pythia-160M变体进行评估,结果表明MLKV显著降低了内存使用量,几乎没有性能损失,将KV缓存大小降低到MQA的6倍。这些结果突显了MLKV在规模化部署Transformer模型方面的潜力。我们在https://github.com/zaydzuhri/pythia-mlkv 提供了代码。
English
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

Summary

AI-Generated Summary

PDF62December 6, 2024