GQLA：面向硬件自适应的大语言模型解码的组查询潜在注意力

摘要

多头潜注意力（MLA）是DeepSeek-V2/V3采用的注意力机制，它将键和值联合压缩为低秩潜在表示，并与H100的roofline模型几乎完美匹配。然而，其训练后的权重仅暴露出一条解码路径——即吸收式MQA形式——这使得高效推理依赖于H100级别的计算带宽比，牺牲了沿注意力头维度的张量并行性，并且在面向出口限制型H20等商用推理GPU时，无法获得多令牌预测（MTP）的增益。我们提出分组查询潜注意力（GQLA），这是对MLA的最小修改，其训练后的权重在相同参数上暴露出两条代数等价的解码路径：一条是与MLA相同的MQA吸收路径，另一条是带有每分组扩展缓存的GQA路径。运行时根据目标硬件选择路径——无需重新训练，无需自定义核——因此一组GQLA权重即可同时锁定H100（MQA吸收路径，s_q=1）和H20（GQA + MTP路径，s_q=2）的roofline，同时在GQA路径上支持最多8路零冗余张量并行。为避免从头预训练，我们将TransMLA扩展为TransGQLA，可将预训练的GQA检查点转换为GQLA模型；在LLaMA-3-8B上，该方法在MQA吸收路径上将每令牌KV缓存压缩至GQA基线的28.125%，同时在每分组路径上结构性地保留GQA级别的流量。

English

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.