Kimi Linear:一种富有表现力且高效的自注意力架构
Kimi Linear: An Expressive, Efficient Attention Architecture
October 30, 2025
作者: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du
cs.AI
摘要
我们推出Kimi Linear混合线性注意力架构,该架构首次在公平比较下全面超越标准注意力机制,涵盖短上下文、长上下文及强化学习规模化场景。其核心是Kimi Delta Attention(KDA)——一种扩展了门控DeltaNet的表达性线性注意力模块,通过更细粒度的门控机制有效利用有限状态的RNN记忆。我们定制的分块算法采用对角加低秩(DPLR)转移矩阵的特化变体,在保持经典delta规则一致性的同时,相较通用DPLR公式显著降低计算量,实现硬件高效性。
基于KDA与多头潜注意力(MLA)的层级混合架构,我们预训练了包含30亿激活参数与480亿总参数的Kimi Linear模型。实验表明:在相同训练方案下,Kimi Linear在所有评估任务中均以明显优势超越全MLA模型,同时将KV缓存使用量降低最高75%,在100万上下文长度下实现最高6倍的解码吞吐量。这些结果证明Kimi Linear可作为标准注意力架构的高性能替代方案,在输入输出长度更长的任务中兼具卓越性能与效率。
为促进后续研究,我们开源了KDA内核与vLLM实现,并发布了预训练及指令微调的模型检查点。
English
We introduce Kimi Linear, a hybrid linear attention architecture that, for
the first time, outperforms full attention under fair comparisons across
various scenarios -- including short-context, long-context, and reinforcement
learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an
expressive linear attention module that extends Gated DeltaNet with a
finer-grained gating mechanism, enabling more effective use of limited
finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware
efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR)
transition matrices, which substantially reduces computation compared to the
general DPLR formulation while remaining more consistent with the classical
delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total
parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention
(MLA). Our experiments show that with an identical training recipe, Kimi Linear
outperforms full MLA with a sizeable margin across all evaluated tasks, while
reducing KV cache usage by up to 75% and achieving up to 6 times decoding
throughput for a 1M context. These results demonstrate that Kimi Linear can be
a drop-in replacement for full attention architectures with superior
performance and efficiency, including tasks with longer input and output
lengths.
To support further research, we open-source the KDA kernel and vLLM
implementations, and release the pre-trained and instruction-tuned model
checkpoints.