GQLA: 하드웨어 적응형 대규모 언어 모델 디코딩을 위한 그룹-쿼리 잠재 어텐션

초록

멀티헤드 잠재 어텐션(Multi-head Latent Attention, MLA)은 DeepSeek-V2/V3에서 사용된 어텐션 기법으로, 키와 값을 저차원 잠재 변수로 공동 압축하며 H100의 루프라인을 거의 완벽하게 충족한다. 그러나 학습된 가중치는 오직 하나의 디코딩 경로, 즉 흡수된 MQA 형태만 노출하며, 이는 효율적인 추론을 H100급 연산-대역폭 비율에 종속시키고, 헤드 축을 따른 텐서 병렬화를 불가능하게 하며, 수출 제한된 H20과 같은 범용 추론 GPU에서 멀티 토큰 예측(MTP) 이점을 전혀 제공하지 못한다. 본 논문에서는 MLA의 최소 수정안인 그룹 쿼리 잠재 어텐션(Group-Query Latent Attention, GQLA)을 제안한다. GQLA의 학습된 가중치는 동일한 매개변수에 대해 두 개의 대수적으로 동등한 디코딩 경로를 노출한다. 하나는 MLA와 동일한 MQA 흡수 경로이고, 다른 하나는 그룹별로 확장된 캐시를 갖는 GQA 경로이다. 런타임은 대상 하드웨어에 맞는 경로를 선택하며, 재훈련이나 커스텀 커널이 필요하지 않다. 따라서 단일 GQLA 가중치 집합으로 H100(s_q=1인 MQA 흡수)과 H20(s_q=2인 GQA + MTP) 두 하드웨어의 루프라인을 모두 충족하면서, GQA 경로에서는 최대 8방향 제로 중복 텐서 병렬화를 지원한다. 처음부터 사전 훈련을 피하기 위해 TransMLA를 TransGQLA로 확장하여, 사전 훈련된 GQA 체크포인트를 GQLA 모델로 변환한다. LLaMA-3-8B에서 이 방법은 MQA 흡수 경로에서 토큰당 KV 캐시를 GQA 기준 대비 28.125%로 압축하면서, 그룹별 경로에서는 GQA 수준의 트래픽을 구조적으로 유지한다.

English

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.