GQLA:用於硬件自適應大型語言模型解碼的分組查詢潛在注意力
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
May 14, 2026
作者: Fanxu Meng
cs.AI
摘要
群組查詢潛在注意力(Group-Query Latent Attention, GQLA),是對多頭潛在注意力(Multi-head Latent Attention, MLA,即DeepSeek-V2/V3所用注意力機制)的最小化修改。MLA將鍵(Keys)與值(Values)共同壓縮為低秩潛在表示,幾乎完美契合H100的roofline模型。然而,其訓練權重僅暴露一種解碼路徑——即吸收式MQA(Absorbed MQA)形式——這使得高效推理綁定於H100等級的計算頻寬比,喪失了沿頭軸(head axis)的張量並行(tensor parallelism)能力,並在如出口受限的H20等商用推理GPU上無法獲得任何多令牌預測(Multi-Token Prediction, MTP)增益。我們提出群組查詢潛在注意力(GQLA),此機制僅對MLA進行極小改動,其訓練權重在相同參數下暴露兩種代價等價的解碼路徑:一是與MLA完全相同的MQA吸收路徑,另一是具備每群組擴展快取的GQA路徑。運行時可根據目標硬體自動選擇合適路徑——無需重新訓練,無需自訂核函數——因此單一組GQLA權重即可同時釘住H100(MQA吸收模式,sq=1)與H20(GQA + MTP模式,sq=2)的roofline,同時在GQA路徑上支援高達8路零冗餘張量並行。為避免從頭預訓練,我們將TransMLA擴展為TransGQLA,可將預訓練的GQA檢查點轉換為GQLA模型;在LLaMA-3-8B上,此法在MQA吸收路徑上將每令牌KV快取壓縮至GQA基線的28.125%,同時在每群組路徑上在結構上保持GQA等級的流量。
English
Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.