TransMLA:多頭潛在注意力就是你所需的
TransMLA: Multi-head Latent Attention Is All You Need
February 11, 2025
作者: Fanxu Meng, Zengwei Yao, Muhan Zhang
cs.AI
摘要
現代大型語言模型(LLMs)通常在當前硬體上遇到通訊瓶頸,而非純粹的計算限制。多頭潛在注意力(MLA)通過在關鍵-值(KV)層中使用低秩矩陣來應對這一挑戰,從而允許壓縮的潛在KV狀態被緩存。這種方法顯著降低了相對於傳統多頭注意力的KV緩存大小,從而實現更快的推理。此外,MLA採用一個上投影矩陣來增加表達能力,以交換額外的計算以減少通訊開銷。儘管MLA在Deepseek V2/V3/R1中展示了效率和有效性,但許多主要模型提供商仍依賴於組查詢注意力(GQA),並且尚未宣布採用MLA的計劃。在本文中,我們展示了GQA始終可以用MLA表示,同時保持相同的KV緩存開銷,但反之則不成立。為了鼓勵更廣泛地使用MLA,我們引入了**TransMLA**,一種後訓練方法,將廣泛使用的基於GQA的預訓練模型(例如LLaMA、Qwen、Mixtral)轉換為基於MLA的模型。轉換後,模型可以進行額外的訓練以提高表達能力,而無需增加KV緩存大小。此外,我們計劃開發MLA特定的推理加速技術,以保持轉換後模型的低延遲,從而實現對Deepseek R1更有效的蒸餾。
English
Modern large language models (LLMs) often encounter communication bottlenecks
on current hardware, rather than purely computational constraints. Multi-head
Latent Attention (MLA) tackles this challenge by using low-rank matrices in the
key-value (KV) layers, thereby allowing compressed latent KV states to be
cached. This approach significantly reduces the KV cache size relative to
traditional multi-head attention, leading to faster inference. Moreover, MLA
employs an up-projection matrix to increase expressiveness, trading additional
computation for reduced communication overhead. Although MLA has demonstrated
efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers
still rely on Group Query Attention (GQA) and have not announced any plans to
adopt MLA. In this paper, we show that GQA can always be represented by MLA
while maintaining the same KV cache overhead, but the converse does not hold.
To encourage broader use of MLA, we introduce **TransMLA**, a post-training
method that converts widely used GQA-based pre-trained models (e.g., LLaMA,
Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo
additional training to boost expressiveness without increasing the KV cache
size. Furthermore, we plan to develop MLA-specific inference acceleration
techniques to preserve low latency in transformed models, thus enabling more
efficient distillation of Deepseek R1.Summary
AI-Generated Summary