ChatPaper.aiChatPaper

每一步都在進化:擴展強化學習以構建萬億級思維模型

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

October 21, 2025
作者: Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen
cs.AI

摘要

我們推出Ring-1T,這是首個開源、具備萬億級參數的頂尖思維模型。該模型總參數達1萬億,每個token激活約500億參數。在萬億參數規模上訓練此類模型面臨前所未有的挑戰,包括訓練與推理的對齊問題、rollout處理效率低下以及強化學習系統的瓶頸。為解決這些問題,我們開創了三項相互關聯的創新:(1) IcePop通過token級差異掩碼和裁剪穩定強化學習訓練,解決了訓練與推理不匹配帶來的不穩定性;(2) C3PO++在token預算下動態劃分長rollout,從而提高資源利用率,實現高時間效率;(3) ASystem,一個高性能強化學習框架,旨在克服阻礙萬億參數模型訓練的系統瓶頸。Ring-1T在關鍵基準測試中取得了突破性成果:AIME-2025得分93.4,HMMT-2025得分86.72,CodeForces得分2088,ARC-AGI-v1得分55.94。值得注意的是,它在IMO-2025上達到了銀牌級別的成績,彰顯了其卓越的推理能力。通過向社區發布完整的1萬億參數MoE模型,我們為研究界提供了直接接觸尖端推理能力的機會。這一貢獻標誌著大規模推理智能民主化的重要里程碑,並為開源模型性能設立了新的基準。
English
We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
PDF512October 22, 2025