每一份注意力都至關重要:長上下文推理的高效混合架構
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
October 22, 2025
作者: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
cs.AI
摘要
在本技術報告中,我們介紹了Ring-linear模型系列,具體包括Ring-mini-linear-2.0和Ring-flash-linear-2.0。Ring-mini-linear-2.0擁有160億參數和9.57億激活數,而Ring-flash-linear-2.0則包含1040億參數和61億激活數。這兩款模型均採用了混合架構,有效整合了線性注意力與softmax注意力機制,在長上下文推理場景中顯著降低了I/O和計算開銷。與一款擁有320億參數的密集模型相比,該系列將推理成本降低至1/10,與原始Ring系列相比,成本也減少了超過50%。此外,通過系統探索混合架構中不同注意力機制之間的比例,我們已識別出當前最優的模型結構。同時,借助我們自主研發的高性能FP8運算庫——linghe,整體訓練效率提升了50%。得益於訓練與推理引擎運算器之間的高度對齊,這些模型在強化學習階段能夠進行長期、穩定且高效的優化,在多個具有挑戰性的複雜推理基準測試中持續保持SOTA性能。
English
In this technical report, we present the Ring-linear model series,
specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.
Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while
Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both
models adopt a hybrid architecture that effectively integrates linear attention
and softmax attention, significantly reducing I/O and computational overhead in
long-context inference scenarios. Compared to a 32 billion parameter dense
model, this series reduces inference cost to 1/10, and compared to the original
Ring series, the cost is also reduced by over 50%. Furthermore, through
systematic exploration of the ratio between different attention mechanisms in
the hybrid architecture, we have identified the currently optimal model
structure. Additionally, by leveraging our self-developed high-performance FP8
operator library-linghe, overall training efficiency has been improved by 50%.
Benefiting from the high alignment between the training and inference engine
operators, the models can undergo long-term, stable, and highly efficient
optimization during the reinforcement learning phase, consistently maintaining
SOTA performance across multiple challenging complex reasoning benchmarks.