每一份注意力都至关重要:面向长上下文推理的高效混合架构
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
October 22, 2025
作者: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
cs.AI
摘要
在本技术报告中,我们介绍了Ring-linear模型系列,具体包括Ring-mini-linear-2.0和Ring-flash-linear-2.0。其中,Ring-mini-linear-2.0拥有160亿参数和9.57亿激活量,而Ring-flash-linear-2.0则包含1040亿参数和61亿激活量。这两款模型均采用了混合架构,有效融合了线性注意力与softmax注意力机制,在长上下文推理场景中显著降低了I/O与计算开销。相较于320亿参数的密集模型,该系列将推理成本降至十分之一;与初代Ring系列相比,成本也减少了超过50%。此外,通过对混合架构中不同注意力机制比例的深入探索,我们确定了当前最优的模型结构。同时,借助自主研发的高性能FP8算子库——linghe,整体训练效率提升了50%。得益于训练与推理引擎算子间的高度一致性,模型在强化学习阶段能够实现长期、稳定且高效的优化,在多个具有挑战性的复杂推理基准测试中持续保持SOTA性能。
English
In this technical report, we present the Ring-linear model series,
specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.
Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while
Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both
models adopt a hybrid architecture that effectively integrates linear attention
and softmax attention, significantly reducing I/O and computational overhead in
long-context inference scenarios. Compared to a 32 billion parameter dense
model, this series reduces inference cost to 1/10, and compared to the original
Ring series, the cost is also reduced by over 50%. Furthermore, through
systematic exploration of the ratio between different attention mechanisms in
the hybrid architecture, we have identified the currently optimal model
structure. Additionally, by leveraging our self-developed high-performance FP8
operator library-linghe, overall training efficiency has been improved by 50%.
Benefiting from the high alignment between the training and inference engine
operators, the models can undergo long-term, stable, and highly efficient
optimization during the reinforcement learning phase, consistently maintaining
SOTA performance across multiple challenging complex reasoning benchmarks.