ChatPaper.aiChatPaper

环轻:通过C3PO稳定的强化学习实现大语言模型的可扩展推理

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

June 17, 2025
作者: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
cs.AI

摘要

我们推出Ring-lite,这是一款基于专家混合(Mixture-of-Experts, MoE)架构的大型语言模型,通过强化学习(Reinforcement Learning, RL)优化,以实现高效且稳健的推理能力。该模型构建于公开可用的Ling-lite模型之上,后者拥有168亿参数,其中27.5亿为激活参数。我们的方法在多个具有挑战性的基准测试(如AIME、LiveCodeBench、GPQA-Diamond)上,仅激活了同类模型所需参数的三分之一,便达到了当前最先进(State-of-the-Art, SOTA)小规模推理模型的性能水平。为此,我们引入了一种结合蒸馏与强化学习的联合训练流程,揭示了MoE RL训练中未被充分记录的挑战。首先,我们识别出RL训练过程中的优化不稳定性,并提出了一种新颖的方法——约束上下文计算策略优化(Constrained Contextual Computation Policy Optimization, C3PO),通过算法-系统协同设计的方法,提升了训练稳定性并改善了计算吞吐量。其次,我们实证表明,基于熵损失而非验证指标选择蒸馏检查点用于RL训练,能在后续RL训练中实现更优的性能-效率权衡。最后,我们开发了一种两阶段训练范式,以协调多领域数据的整合,解决了混合数据集训练中出现的领域冲突问题。我们将发布该模型、数据集及代码。
English
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.
PDF82June 18, 2025