크로스 레이어 어텐션을 활용한 트랜스포머 키-값 캐시 크기 축소

초록

키-값(Key-Value, KV) 캐싱은 트랜스포머 기반의 자기회귀적 대규모 언어 모델(LLMs)의 디코딩 속도를 높이는 데 중요한 역할을 합니다. 그러나 긴 시퀀스 길이와 큰 배치 크기에서 KV 캐시를 저장하는 데 필요한 메모리 양이 과도하게 커질 수 있습니다. 트랜스포머가 발명된 이후, KV 캐시의 크기를 줄이기 위해 발견된 가장 효과적인 두 가지 방법은 멀티-쿼리 어텐션(Multi-Query Attention, MQA)과 이를 일반화한 그룹드-쿼리 어텐션(Grouped-Query Attention, GQA)입니다. MQA와 GQA는 모두 어텐션 블록의 설계를 수정하여 여러 쿼리 헤드가 단일 키/값 헤드를 공유할 수 있게 함으로써, 정확도를 최소한으로 저하시키면서도 별개의 키/값 헤드 수를 크게 줄입니다. 본 논문에서는 멀티-쿼리 어텐션을 한 단계 더 발전시켜 인접한 레이어 간에도 키와 값 헤드를 공유하는 새로운 어텐션 설계인 크로스-레이어 어텐션(Cross-Layer Attention, CLA)을 제안합니다. CLA를 사용하면 수정되지 않은 MQA와 거의 동일한 정확도를 유지하면서 KV 캐시의 크기를 추가로 2배 줄일 수 있음을 확인했습니다. 1B 및 3B 파라미터 모델을 처음부터 학습하는 실험에서, CLA는 기존 MQA가 가능한 메모리/정확도 트레이드오프에 대해 파레토 개선을 제공하며, 더 긴 시퀀스 길이와 더 큰 배치 크기로의 추론을 가능하게 합니다.

English

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible

크로스 레이어 어텐션을 활용한 트랜스포머 키-값 캐시 크기 축소

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

초록

Support