멀티-헤드 저순위 어텐션

초록

대규모 언어 모델의 장문 추론은 디코딩 단계에서 Key-Value(KV) 캐시 로딩에 의해 병목 현상이 발생하며, 이는 생성 과정의 순차적 특성으로 인해 매 단계마다 KV 캐시를 오프칩 고대역폭 메모리(HBM)에서 온칩 정적 랜덤 액세스 메모리(SRAM)로 반복적으로 전송해야 하기 때문입니다. Multi-Head Latent Attention(MLA)은 전체 KV 캐시 크기를 크게 줄이지만, Tensor Parallelism(TP)을 통한 분산 디코딩 시 샤딩 병목 현상이 발생합니다. MLA의 단일 잠재 헤드는 분할이 불가능하여 각 디바이스가 모든 토큰에 대해 전체 KV 캐시를 중복 로드해야 하며, 이로 인해 과도한 메모리 트래픽이 소모되고 가중치 샤딩과 같은 TP의 이점이 감소합니다. 본 연구에서는 효율적인 4-way TP 디코딩을 위해 분할 가능한 잠재 상태를 지원하는 Multi-Head Low-Rank Attention(MLRA)을 제안합니다. 광범위한 실험을 통해 MLRA가 최첨단 수준의 perplexity 및 다운스트림 작업 성능을 달성하는 동시에 MLA 대비 2.8배의 디코딩 속도 향상을 제공함을 입증했습니다. 코드는 https://github.com/SongtaoLiu0823/MLRA에서 확인할 수 있습니다. 사전 학습된 가중치와 학습 및 평가 데이터는 https://huggingface.co/Soughing/MLRA에서 이용 가능합니다.

English

Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8times decoding speedup over MLA. Code is available at https://github.com/SongtaoLiu0823/MLRA. Pretrained weights, along with the training and evaluation data, are available at https://huggingface.co/Soughing/MLRA.