MiniCPM-SALA:融合稀疏与线性注意力机制的高效长上下文建模
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
February 12, 2026
作者: MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Dong Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Xinyuan Zhang, Zhu Zhang, Hengyu Zhao, Jiacheng Zhao, Jie Zhou, Zihan Zhou, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
大型语言模型(LLM)向超长上下文应用演进的过程中,面临Transformer架构高昂计算与内存成本带来的挑战。现有稀疏注意力和线性注意力机制虽试图缓解该问题,但通常需要在内存效率与模型性能之间进行权衡。本文提出MiniCPM-SALA——一种90亿参数的混合架构,融合了稀疏注意力(InfLLM-V2)的高保真长上下文建模能力与线性注意力(Lightning Attention)的全局效率。通过采用层级选择算法以1:3比例集成这两种机制,并运用混合位置编码(HyPE),该模型在长上下文任务中兼顾效率与性能。此外,我们引入一种低成本持续训练框架,可将基于Transformer的预训练模型转化为混合模型,相比从头训练降低约75%的训练成本。大量实验表明,MiniCPM-SALA在保持与全注意力模型相当通用能力的同时,显著提升了效率。在单张NVIDIA A6000D GPU上,该模型在256K令牌序列长度下推理速度可达全注意力模型的3.5倍,并支持最高100万令牌的上下文长度,而传统80亿参数全注意力模型因内存限制无法达到该规模。
English
The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.