TransMLA：多頭潛在注意力就是你所需的

摘要

現代大型語言模型（LLMs）通常在當前硬體上遇到通訊瓶頸，而非純粹的計算限制。多頭潛在注意力（MLA）通過在關鍵-值（KV）層中使用低秩矩陣來應對這一挑戰，從而允許壓縮的潛在KV狀態被緩存。這種方法顯著降低了相對於傳統多頭注意力的KV緩存大小，從而實現更快的推理。此外，MLA採用一個上投影矩陣來增加表達能力，以交換額外的計算以減少通訊開銷。儘管MLA在Deepseek V2/V3/R1中展示了效率和有效性，但許多主要模型提供商仍依賴於組查詢注意力（GQA），並且尚未宣布採用MLA的計劃。在本文中，我們展示了GQA始終可以用MLA表示，同時保持相同的KV緩存開銷，但反之則不成立。為了鼓勵更廣泛地使用MLA，我們引入了**TransMLA**，一種後訓練方法，將廣泛使用的基於GQA的預訓練模型（例如LLaMA、Qwen、Mixtral）轉換為基於MLA的模型。轉換後，模型可以進行額外的訓練以提高表達能力，而無需增加KV緩存大小。此外，我們計劃開發MLA特定的推理加速技術，以保持轉換後模型的低延遲，從而實現對Deepseek R1更有效的蒸餾。

English

Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce **TransMLA**, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.

TransMLA：多頭潛在注意力就是你所需的

TransMLA: Multi-head Latent Attention Is All You Need

摘要

Support