TPLA: Tensor Parallel Latent Attention voor Efficiënte Gedisaggregeerde Prefill- en Decode-Inferentie

Samenvatting

Multi-Head Latent Attention (MLA), geïntroduceerd in DeepSeek-V2, comprimeert key-value states naar een low-rank latent vector, waarbij alleen deze vector wordt gecached om geheugen te besparen. In tensor-parallelisme (TP) worden aandachtskoppen echter berekend over meerdere apparaten, en elk apparaat moet de volledige cache laden, wat het voordeel van MLA ten opzichte van Grouped Query Attention (GQA) ondermijnt. Wij stellen Tensor-Parallel Latent Attention (TPLA) voor: een methode die zowel de latenterepresentatie als de invoerdimensie van elke kop verdeelt over apparaten, onafhankelijk aandacht uitvoert per shard, en vervolgens de resultaten combineert met een all-reduce. TPLA behoudt de voordelen van een gecomprimeerde KV-cache terwijl het de efficiëntie van TP benut. In tegenstelling tot Grouped Latent Attention (GLA) maakt elke kop in TPLA nog steeds gebruik van de volledige latenterepresentatie, waardoor een sterkere representatiecapaciteit behouden blijft. TPLA is direct compatibel met modellen die zijn voorgetraind met MLA: het ondersteunt MLA-style prefilling en maakt efficiënte tensor-parallelle decodering mogelijk zonder hertraining. Het toepassen van eenvoudige orthogonale transformaties — zoals de Hadamard-transformatie of PCA — vóór TP-slicing vermindert verder de interferentie tussen shards, wat resulteert in minimale nauwkeurigheidsvermindering. Door de per-apparaat KV-cache te verkleinen voor DeepSeek-V3 en Kimi-K2, behalen we respectievelijk 1,79x en 1,93x versnellingen bij een contextlengte van 32K tokens, terwijl de prestaties op commonsense- en LongBench-benchmarks behouden blijven. TPLA kan worden geïmplementeerd met FlashAttention-3, wat praktische end-to-end versnelling mogelijk maakt.

English

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms -- e.g., the Hadamard transform or PCA -- before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.

TPLA: Tensor Parallel Latent Attention voor Efficiënte Gedisaggregeerde Prefill- en Decode-Inferentie

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference

Samenvatting

Support