ChatPaper.aiChatPaper

DRIVE:面向可验证奖励的竞争性代码生成强化学习数据管理最佳实践

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

November 9, 2025
作者: Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou
cs.AI

摘要

近期以推理优先的模型(如OpenAI o1、DeepSeek R1)推动了对RLVR的再度关注。然而,该领域的进展主要由数学类任务(如AIME)主导,竞争性编程代码生成方向尚未得到充分探索,且数据构建获得的关注远少于强化学习算法设计。我们研究了如何构建RLVR数据集(即强化学习提示),并提出了一套在竞争性编程代码生成任务上表现优异的实用训练技术。我们的流程始于基于开源强模型进行监督微调,并融合通用任务与高难度推理数据。强化学习训练采用两阶段流程:首先在均匀分布的竞争性编程题库上,使用组相对策略优化算法(每提示8次 rollout),配合较短响应生成窗口(如监督微调阶段32k,本阶段24k)来扩展熵分布以缓解重复与截断问题;随后实施预组相对策略优化阶段:在精选的高难度题目集上,采用大采样预算(每提示64次 rollout)和硬聚焦课程策略——在训练全程持续保留最难实例。我们将该方法应用于Qwen2.5-32B模型,并在LeetCode和Codeforces周赛上进行防数据泄露评估。最终模型在同等规模模型中达到最优性能,与DeepSeek v3.1、豆包-1.5-思考等领先系统表现相当。我们还分析了扩展规律,在内部大规模混合专家模型上观察到显著的强化学习扩展效应。本研究提炼出针对竞争性编程代码生成的RLVR数据构建、熵扩展和课程设计的简明最佳实践。
English
Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform Pre-GRPO: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.
PDF505December 2, 2025