Ring-lite:基於C3PO穩定的強化學習實現大語言模型的可擴展推理
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs
June 17, 2025
作者: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
cs.AI
摘要
我們推出Ring-lite,這是一個基於專家混合(Mixture-of-Experts, MoE)的大型語言模型,通過強化學習(Reinforcement Learning, RL)進行優化,以實現高效且穩健的推理能力。該模型建立在公開可用的Ling-lite模型基礎上,這是一個擁有168億參數、激活參數達27.5億的模型。我們的方法在具有挑戰性的基準測試(如AIME、LiveCodeBench、GPQA-Diamond)上,與最先進(State-of-the-Art, SOTA)的小規模推理模型性能相當,而僅激活了同類模型所需參數的三分之一。為此,我們引入了一個結合蒸餾與RL的聯合訓練管道,揭示了MoE RL訓練中未記錄的挑戰。首先,我們識別出RL訓練期間的優化不穩定性,並提出了約束上下文計算策略優化(Constrained Contextual Computation Policy Optimization, C3PO),這是一種通過算法-系統協同設計方法來增強訓練穩定性並提高計算吞吐量的新方法。其次,我們實證表明,基於熵損失選擇蒸餾檢查點進行RL訓練,而非驗證指標,能在後續RL訓練中實現更優的性能-效率權衡。最後,我們開發了一個兩階段訓練範式,以協調多領域數據的整合,解決了在混合數據集訓練中出現的領域衝突問題。我們將發布模型、數據集及代碼。
English
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model
optimized via reinforcement learning (RL) to achieve efficient and robust
reasoning capabilities. Built upon the publicly available Ling-lite model, a
16.8 billion parameter model with 2.75 billion activated parameters, our
approach matches the performance of state-of-the-art (SOTA) small-scale
reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench,
GPQA-Diamond) while activating only one-third of the parameters required by
comparable models. To accomplish this, we introduce a joint training pipeline
integrating distillation with RL, revealing undocumented challenges in MoE RL
training. First, we identify optimization instability during RL training, and
we propose Constrained Contextual Computation Policy Optimization(C3PO), a
novel approach that enhances training stability and improves computational
throughput via algorithm-system co-design methodology. Second, we empirically
demonstrate that selecting distillation checkpoints based on entropy loss for
RL training, rather than validation metrics, yields superior
performance-efficiency trade-offs in subsequent RL training. Finally, we
develop a two-stage training paradigm to harmonize multi-domain data
integration, addressing domain conflicts that arise in training with mixed
dataset. We will release the model, dataset, and code.