GQLA: Group-Query Latent Attention für hardware-adaptive Dekodierung großer Sprachmodelle

Zusammenfassung

Die Multi-Head Latent Attention (MLA), die in DeepSeek-V2/V3 verwendete Aufmerksamkeit, komprimiert sowohl Schlüssel als auch Werte gemeinsam in eine niedrigrangige latente Variable und erreicht fast perfekt die H100-Roofline. Ihre trainierten Gewichte legen jedoch nur einen Dekodierungspfad offen – eine absorbierte MQA-Form –, der eine effiziente Inferenz an das Rechen-Bandbreiten-Verhältnis der H100-Klasse bindet, Tensorparallelität entlang der Kopfachse ausschließt und keinen Gewinn durch Multi-Token-Vorhersage (MTP) auf handelsüblichen Inferenz-GPUs wie der exportbeschränkten H20 erzielt. Wir schlagen die Group-Query Latent Attention (GQLA) vor, eine minimale Modifikation der MLA, deren trainierte Gewichte zwei algebraisch äquivalente Dekodierungspfade über dieselben Parameter freigeben: einen MQA-Absorptionspfad, der mit dem der MLA identisch ist, und einen GQA-Pfad mit einem pro Gruppe erweiterten Cache. Die Laufzeit wählt den Pfad, der zur Zielhardware passt – ohne erneutes Training, ohne benutzerdefinierte Kernel –, sodass ein einzelner Satz von GQLA-Gewichten die Rooflines sowohl der H100 (MQA-Absorption, s_q=1) als auch der H20 (GQA + MTP, s_q=2) trifft, während auf dem GQA-Pfad eine bis zu 8-fache nullredundante Tensorparallelität unterstützt wird. Um ein Vortraining von Grund auf zu vermeiden, erweitern wir TransMLA zu TransGQLA, das einen vortrainierten GQA-Checkpoint in ein GQLA-Modell umwandelt; bei LLaMA-3-8B komprimiert es den Pro-Token-KV-Cache auf 28,125 % des GQA-Ausgangswerts auf dem MQA-Absorptionspfad, während es strukturell den GQA-Niveau-Verkehr auf dem Pro-Gruppen-Pfad bewahrt.

English

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.