环轻：通过C3PO稳定的强化学习实现大语言模型的可扩展推理

摘要

我们推出Ring-lite，这是一款基于专家混合（Mixture-of-Experts, MoE）架构的大型语言模型，通过强化学习（Reinforcement Learning, RL）优化，以实现高效且稳健的推理能力。该模型构建于公开可用的Ling-lite模型之上，后者拥有168亿参数，其中27.5亿为激活参数。我们的方法在多个具有挑战性的基准测试（如AIME、LiveCodeBench、GPQA-Diamond）上，仅激活了同类模型所需参数的三分之一，便达到了当前最先进（State-of-the-Art, SOTA）小规模推理模型的性能水平。为此，我们引入了一种结合蒸馏与强化学习的联合训练流程，揭示了MoE RL训练中未被充分记录的挑战。首先，我们识别出RL训练过程中的优化不稳定性，并提出了一种新颖的方法——约束上下文计算策略优化（Constrained Contextual Computation Policy Optimization, C3PO），通过算法-系统协同设计的方法，提升了训练稳定性并改善了计算吞吐量。其次，我们实证表明，基于熵损失而非验证指标选择蒸馏检查点用于RL训练，能在后续RL训练中实现更优的性能-效率权衡。最后，我们开发了一种两阶段训练范式，以协调多领域数据的整合，解决了混合数据集训练中出现的领域冲突问题。我们将发布该模型、数据集及代码。

English

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

环轻：通过C3PO稳定的强化学习实现大语言模型的可扩展推理

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

摘要

Support