Kimi Linear:一種表現力強、高效的注意力架構
Kimi Linear: An Expressive, Efficient Attention Architecture
October 30, 2025
作者: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du
cs.AI
摘要
我們推出Kimi線性注意力架構,這是首個在公平比較下全面超越傳統全注意力的混合線性注意力模型,其優勢涵蓋短上下文、長上下文及強化學習等多種規模場景。該架構的核心是Kimi Delta Attention(KDA),這一高表達力的線性注意力模塊在門控DeltaNet基礎上引入更精細的門控機制,使有限狀態的RNN記憶體得以更高效利用。我們定制的分塊算法通過專用對角加低秩(DPLR)轉移矩陣變體實現硬件高效性,相較通用DPLR公式大幅減少計算量,同時更貼合經典delta規則的數學原理。
基於KDA與多頭潛在注意力(MLA)的分層混合設計,我們預訓練了包含30億激活參數與480億總參數的Kimi Linear模型。實驗表明,在相同訓練配方下,該模型在所有評估任務中均顯著超越全MLA架構,同時將KV緩存使用量降低達75%,在百萬級上下文長度下解碼吞吐量提升最高達6倍。這些結果證實Kimi Linear可作為全注意力架構的高性能替代方案,尤其在處理長輸入輸出序列時兼具卓越效率。
為推動相關研究,我們開源了KDA內核與vLLM實現方案,並發布預訓練及指令微調的模型檢查點。
English
We introduce Kimi Linear, a hybrid linear attention architecture that, for
the first time, outperforms full attention under fair comparisons across
various scenarios -- including short-context, long-context, and reinforcement
learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an
expressive linear attention module that extends Gated DeltaNet with a
finer-grained gating mechanism, enabling more effective use of limited
finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware
efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR)
transition matrices, which substantially reduces computation compared to the
general DPLR formulation while remaining more consistent with the classical
delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total
parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention
(MLA). Our experiments show that with an identical training recipe, Kimi Linear
outperforms full MLA with a sizeable margin across all evaluated tasks, while
reducing KV cache usage by up to 75% and achieving up to 6 times decoding
throughput for a 1M context. These results demonstrate that Kimi Linear can be
a drop-in replacement for full attention architectures with superior
performance and efficiency, including tasks with longer input and output
lengths.
To support further research, we open-source the KDA kernel and vLLM
implementations, and release the pre-trained and instruction-tuned model
checkpoints.