GQLA：用於硬件自適應大型語言模型解碼的分組查詢潛在注意力

摘要

群組查詢潛在注意力（Group-Query Latent Attention, GQLA），是對多頭潛在注意力（Multi-head Latent Attention, MLA，即DeepSeek-V2/V3所用注意力機制）的最小化修改。MLA將鍵（Keys）與值（Values）共同壓縮為低秩潛在表示，幾乎完美契合H100的roofline模型。然而，其訓練權重僅暴露一種解碼路徑——即吸收式MQA（Absorbed MQA）形式——這使得高效推理綁定於H100等級的計算頻寬比，喪失了沿頭軸（head axis）的張量並行（tensor parallelism）能力，並在如出口受限的H20等商用推理GPU上無法獲得任何多令牌預測（Multi-Token Prediction, MTP）增益。我們提出群組查詢潛在注意力（GQLA），此機制僅對MLA進行極小改動，其訓練權重在相同參數下暴露兩種代價等價的解碼路徑：一是與MLA完全相同的MQA吸收路徑，另一是具備每群組擴展快取的GQA路徑。運行時可根據目標硬體自動選擇合適路徑——無需重新訓練，無需自訂核函數——因此單一組GQLA權重即可同時釘住H100（MQA吸收模式，sq=1）與H20（GQA + MTP模式，sq=2）的roofline，同時在GQA路徑上支援高達8路零冗餘張量並行。為避免從頭預訓練，我們將TransMLA擴展為TransGQLA，可將預訓練的GQA檢查點轉換為GQLA模型；在LLaMA-3-8B上，此法在MQA吸收路徑上將每令牌KV快取壓縮至GQA基線的28.125%，同時在每群組路徑上在結構上保持GQA等級的流量。

English

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.