步步进化:面向万亿级思维模型的强化学习扩展
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
October 21, 2025
作者: Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen
cs.AI
摘要
我们推出Ring-1T,这是首个开源、具备万亿级参数规模的顶尖思维模型。该模型总计拥有1万亿参数,每个令牌激活约500亿参数。在万亿参数规模上训练此类模型带来了前所未有的挑战,包括训练与推理的不一致、rollout处理效率低下以及强化学习系统的瓶颈问题。为解决这些问题,我们开创了三大相互关联的创新技术:(1) IcePop通过令牌级差异掩码与裁剪稳定强化学习训练,解决了训练与推理不匹配导致的不稳定性;(2) C3PO++在令牌预算下动态划分长rollout,提升了资源利用率,实现了高时间效率;(3) ASystem,一个高性能强化学习框架,旨在克服阻碍万亿参数模型训练的系统瓶颈。Ring-1T在关键基准测试中取得了突破性成果:AIME-2025得分93.4,HMMT-2025得分86.72,CodeForces得分2088,ARC-AGI-v1得分55.94。尤为突出的是,它在IMO-2025上达到了银牌水平,彰显了其卓越的推理能力。通过向社区发布完整的1T参数MoE模型,我们为研究界提供了直接访问尖端推理能力的途径。这一贡献标志着大规模推理智能民主化的重要里程碑,并为开源模型性能设立了新基准。
English
We present Ring-1T, the first open-source, state-of-the-art thinking model
with a trillion-scale parameter. It features 1 trillion total parameters and
activates approximately 50 billion per token. Training such models at a
trillion-parameter scale introduces unprecedented challenges, including
train-inference misalignment, inefficiencies in rollout processing, and
bottlenecks in the RL system. To address these, we pioneer three interconnected
innovations: (1) IcePop stabilizes RL training via token-level discrepancy
masking and clipping, resolving instability from training-inference mismatches;
(2) C3PO++ improves resource utilization for long rollouts under a token budget
by dynamically partitioning them, thereby obtaining high time efficiency; and
(3) ASystem, a high-performance RL framework designed to overcome the systemic
bottlenecks that impede trillion-parameter model training. Ring-1T delivers
breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on
HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a
silver medal-level result on the IMO-2025, underscoring its exceptional
reasoning capabilities. By releasing the complete 1T parameter MoE model to the
community, we provide the research community with direct access to cutting-edge
reasoning capabilities. This contribution marks a significant milestone in
democratizing large-scale reasoning intelligence and establishes a new baseline
for open-source model performance.