GQLA:面向硬件自适应的大语言模型解码的组查询潜在注意力
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
May 14, 2026
作者: Fanxu Meng
cs.AI
摘要
多头潜注意力(MLA)是DeepSeek-V2/V3采用的注意力机制,它将键和值联合压缩为低秩潜在表示,并与H100的roofline模型几乎完美匹配。然而,其训练后的权重仅暴露出一条解码路径——即吸收式MQA形式——这使得高效推理依赖于H100级别的计算带宽比,牺牲了沿注意力头维度的张量并行性,并且在面向出口限制型H20等商用推理GPU时,无法获得多令牌预测(MTP)的增益。我们提出分组查询潜注意力(GQLA),这是对MLA的最小修改,其训练后的权重在相同参数上暴露出两条代数等价的解码路径:一条是与MLA相同的MQA吸收路径,另一条是带有每分组扩展缓存的GQA路径。运行时根据目标硬件选择路径——无需重新训练,无需自定义核——因此一组GQLA权重即可同时锁定H100(MQA吸收路径,s_q=1)和H20(GQA + MTP路径,s_q=2)的roofline,同时在GQA路径上支持最多8路零冗余张量并行。为避免从头预训练,我们将TransMLA扩展为TransGQLA,可将预训练的GQA检查点转换为GQLA模型;在LLaMA-3-8B上,该方法在MQA吸收路径上将每令牌KV缓存压缩至GQA基线的28.125%,同时在每分组路径上结构性地保留GQA级别的流量。
English
Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.