Ring-lite：基於C3PO穩定的強化學習實現大語言模型的可擴展推理

摘要

我們推出Ring-lite，這是一個基於專家混合（Mixture-of-Experts, MoE）的大型語言模型，通過強化學習（Reinforcement Learning, RL）進行優化，以實現高效且穩健的推理能力。該模型建立在公開可用的Ling-lite模型基礎上，這是一個擁有168億參數、激活參數達27.5億的模型。我們的方法在具有挑戰性的基準測試（如AIME、LiveCodeBench、GPQA-Diamond）上，與最先進（State-of-the-Art, SOTA）的小規模推理模型性能相當，而僅激活了同類模型所需參數的三分之一。為此，我們引入了一個結合蒸餾與RL的聯合訓練管道，揭示了MoE RL訓練中未記錄的挑戰。首先，我們識別出RL訓練期間的優化不穩定性，並提出了約束上下文計算策略優化（Constrained Contextual Computation Policy Optimization, C3PO），這是一種通過算法-系統協同設計方法來增強訓練穩定性並提高計算吞吐量的新方法。其次，我們實證表明，基於熵損失選擇蒸餾檢查點進行RL訓練，而非驗證指標，能在後續RL訓練中實現更優的性能-效率權衡。最後，我們開發了一個兩階段訓練範式，以協調多領域數據的整合，解決了在混合數據集訓練中出現的領域衝突問題。我們將發布模型、數據集及代碼。

English

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

Ring-lite：基於C3PO穩定的強化學習實現大語言模型的可擴展推理

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

摘要

Support