MLKV: 메모리 효율적인 트랜스포머 디코딩을 위한 다중 계층 키-값 헤드

초록

트랜스포머의 자기회귀적 추론은 Key-Value(KV) 캐싱을 통해 큰 이점을 얻지만, 모델 크기, 배치 크기, 시퀀스 길이가 증가함에 따라 메모리 병목 현상이 심각해질 수 있습니다. 본 연구에서는 Multi-Query Attention(MQA) 및 Grouped-Query Attention(GQA)을 넘어서는 메모리 사용량 감소를 위해 트랜스포머 레이어 간에 KV 공유를 확장한 새로운 접근 방식인 Multi-Layer Key-Value(MLKV) 공유를 소개합니다. 다양한 NLP 벤치마크와 업트레이닝된 Pythia-160M 변형 모델을 사용한 추론 지표 평가를 통해 MLKV가 최소한의 성능 손실로 메모리 사용량을 크게 줄이며, MQA 대비 KV 캐시 크기를 최대 6배까지 감소시킬 수 있음을 입증했습니다. 이러한 결과는 MLKV가 트랜스포머 모델의 효율적인 대규모 배포에 있어 잠재력을 가지고 있음을 보여줍니다. 코드는 https://github.com/zaydzuhri/pythia-mlkv에서 제공됩니다.

English

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

MLKV: 메모리 효율적인 트랜스포머 디코딩을 위한 다중 계층 키-값 헤드

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

초록

Support