突破训练瓶颈：编码模型的高效稳定强化学习方法

摘要

现代代码生成模型呈现出输出更长、能力增长加速及训练动态变化等特征，使得传统训练方法、算法与数据集难以有效提升其性能。为突破这些训练瓶颈，我们提出MicroCoder-GRPO——一种改进的群体相对策略优化方法，其包含三项创新：通过条件截断掩码在保持训练稳定性的同时提升长输出潜力，采用多样性驱动的温度选择机制维持并促进输出多样性，以及通过高剪裁比移除KL损失以增强解空间多样性。在LiveCodeBench v6基准测试中，MicroCoder-GRPO相较于强基线实现最高17.6%的相对提升，且在长上下文评估中增益更为显著。同时我们开源了MicroCoder-Dataset，该更具挑战性的训练语料在300步训练内于LiveCodeBench v6上实现主流数据集3倍的性能增益；并发布MicroCoder-Evaluator评估框架，其评估准确率提升约25%，执行速度加快约40%。通过对三十余组对照实验的系统分析，我们提炼出涵盖七大维度的34项训练洞见，证明经过恰当训练的模型可实现与更大规模模型相媲美的性能。

English

Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.