GQLA: Groep-Query Latente Aandacht voor Hardware-adaptieve Decodering van Grote Taalmodellen

Samenvatting

Multi-head Latent Attention (MLA), de aandacht gebruikt in DeepSeek-V2/V3, comprimeert gezamenlijk keys en values in een laagrangige latente representatie en sluit bijna perfect aan op de H100-rooflijn. De getrainde gewichten bieden echter slechts één decodeerpad – een geabsorbeerde MQA-vorm – die efficiënte inferentie koppelt aan H100-klasse reken-bandbreedteverhoudingen, tensorparallelisme langs de hoofdas uitsluit, en geen winst oplevert voor Multi-Token Voorspelling (MTP) op gangbare inferentie-GPU's zoals de exportbeperkte H20. Wij stellen Group-Query Latent Attention (GQLA) voor, een minimale wijziging van MLA waarvan de getrainde gewichten twee algebraïsch equivalente decodeerpaden over dezelfde parameters blootleggen: een MQA-geabsorbeerd pad identiek aan dat van MLA, en een GQA-pad met een per-groep uitgebreide cache. De runtime kiest het pad dat bij de doelhardware past – geen hertraining, geen aangepaste kernels – zodat een enkele set GQLA-gewichten de rooflijnen van zowel H100 (MQA-geabsorbeerd, s_q=1) als H20 (GQA + MTP, s_q=2) vastpint, terwijl tot 8-voudig nul-redundantie tensorparallelisme op het GQA-pad wordt ondersteund. Om training vanaf nul te vermijden, breiden we TransMLA uit tot TransGQLA, dat een voorgetraind GQA-checkpoint omzet in een GQLA-model; op LLaMA-3-8B comprimeert het de per-token KV-cache tot 28,125% van de GQA-baseline op het MQA-geabsorbeerde pad, terwijl het structureel GQA-niveauverkeer op het per-groep pad behoudt.

English

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.