环轻:通过C3PO稳定的强化学习实现大语言模型的可扩展推理
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs
June 17, 2025
作者: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
cs.AI
摘要
我们推出Ring-lite,这是一款基于专家混合(Mixture-of-Experts, MoE)架构的大型语言模型,通过强化学习(Reinforcement Learning, RL)优化,以实现高效且稳健的推理能力。该模型构建于公开可用的Ling-lite模型之上,后者拥有168亿参数,其中27.5亿为激活参数。我们的方法在多个具有挑战性的基准测试(如AIME、LiveCodeBench、GPQA-Diamond)上,仅激活了同类模型所需参数的三分之一,便达到了当前最先进(State-of-the-Art, SOTA)小规模推理模型的性能水平。为此,我们引入了一种结合蒸馏与强化学习的联合训练流程,揭示了MoE RL训练中未被充分记录的挑战。首先,我们识别出RL训练过程中的优化不稳定性,并提出了一种新颖的方法——约束上下文计算策略优化(Constrained Contextual Computation Policy Optimization, C3PO),通过算法-系统协同设计的方法,提升了训练稳定性并改善了计算吞吐量。其次,我们实证表明,基于熵损失而非验证指标选择蒸馏检查点用于RL训练,能在后续RL训练中实现更优的性能-效率权衡。最后,我们开发了一种两阶段训练范式,以协调多领域数据的整合,解决了混合数据集训练中出现的领域冲突问题。我们将发布该模型、数据集及代码。
English
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model
optimized via reinforcement learning (RL) to achieve efficient and robust
reasoning capabilities. Built upon the publicly available Ling-lite model, a
16.8 billion parameter model with 2.75 billion activated parameters, our
approach matches the performance of state-of-the-art (SOTA) small-scale
reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench,
GPQA-Diamond) while activating only one-third of the parameters required by
comparable models. To accomplish this, we introduce a joint training pipeline
integrating distillation with RL, revealing undocumented challenges in MoE RL
training. First, we identify optimization instability during RL training, and
we propose Constrained Contextual Computation Policy Optimization(C3PO), a
novel approach that enhances training stability and improves computational
throughput via algorithm-system co-design methodology. Second, we empirically
demonstrate that selecting distillation checkpoints based on entropy loss for
RL training, rather than validation metrics, yields superior
performance-efficiency trade-offs in subsequent RL training. Finally, we
develop a two-stage training paradigm to harmonize multi-domain data
integration, addressing domain conflicts that arise in training with mixed
dataset. We will release the model, dataset, and code.