マルチヘッド低ランク注意機構

要旨

大規模言語モデルにおける長文脈推論は、復号化段階におけるKey-Value（KV）キャッシュの読み込みがボトルネックとなっている。生成処理の逐次的な性質により、各ステップでオフチップの高帯域幅メモリ（HBM）からオンチップのスタティックRAM（SRAM）へKVキャッシュを繰り返し転送する必要がある。Multi-Head Latent Attention（MLA）はKVキャッシュの総容量を大幅に削減するが、テンソル並列化（TP）を用いた分散復号化時にシャーディングのボトルネックが生じる。単一の潜在ヘッドは分割できないため、各デバイスはトークン毎に完全なKVキャッシュを冗長に読み込むことを強制され、メモリトラフィックを過剰に消費し、重みのシャーディングのようなTPの利点を減じている。本研究では、分割可能な潜在状態を実現し効率的な4方向TP復号化を可能とするMulti-Head Low-Rank Attention（MLRA）を提案する。大規模な実験により、MLRAが最先端のパープレキシティと下流タスクの性能を達成するとともに、MLAと比較して2.8倍の復号化速度向上を実現することを示す。コードはhttps://github.com/SongtaoLiu0823/MLRA で公開されている。事前学習済み重み、および学習と評価のデータは https://huggingface.co/Soughing/MLRA で利用可能である。

English

Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8times decoding speedup over MLA. Code is available at https://github.com/SongtaoLiu0823/MLRA. Pretrained weights, along with the training and evaluation data, are available at https://huggingface.co/Soughing/MLRA.