DRIVE:具可驗證獎勵的競爭性程式碼生成強化學習資料管理最佳實踐
DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation
November 9, 2025
作者: Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou
cs.AI
摘要
近期以推理為先的模型(如OpenAI o1、DeepSeek R1)推動了強化學習與驗證推理(RLVR)的復興。然而,相關進展主要由數學領域(如AIME)主導,競爭性程式設計的程式碼生成研究相對不足,且資料處理所獲關注遠少於強化學習演算法設計。本研究探討如何建構RLVR資料集(即RL提示詞),並提出實用訓練技術,在競爭性程式設計程式碼生成任務中實現強勁表現。我們的流程始於從強力開源模型蒸餾而來的監督式微調(SFT),並輔以通用性與高強度推理資料進行增強。後續強化學習採用雙階段流程,以可執行的測試案例驅動獎勵機制:第一階段使用群組相對策略優化(GRPO),在大量均勻分佈的競爭性程式設計題庫上訓練,每道提示詞生成8次回應,並設定較短的回應生成視窗(如SFT階段32k,本階段24k),以擴大熵值並緩解重複與截斷問題;第二階段實施預訓練GRPO:在精選的小型高難度題庫上,採用每道提示詞64次回應的大規模預算,透過持續保留最難實例的硬性聚焦課程進行更新。我們將該方法實作於Qwen2.5-32B模型,並在LeetCode與Codeforces週賽中評估以避免資料洩漏。最終模型在同等規模模型中達到最先進效能,與DeepSeek v3.1、Doubao-1.5-Thinking等頂尖系統表現相當。我們同時驗證擴展規律,在內部大規模混合專家模型上觀察到顯著的強化學習擴展效應。本研究總結出針對競爭性程式設計程式碼生成的RLVR資料處理、熵擴展與課程設計的簡明最佳實踐。
English
Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a
resurgence of interest in RLVR. Nevertheless, advances are dominated by
mathematics (e.g., AIME), with competitive-programming code generation
underexplored and data curation receiving less attention than RL algorithm
design. We investigate how to construct RLVR datasets (i.e., RL prompts) and
present practical training techniques that yield strong performance on
competitive-programming code generation. Our pipeline begins with supervised
fine-tuning (SFT) distilled from strong open-source models, augmented with
general-purpose and reasoning-intensive data. RL then follows a two-stage
process with executable, testcase-driven rewards: first, training on a large,
uniformly distributed set of competitive-programming problems using Group
Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively
short response-generation window (e.g., 32k during SFT and 24k in this stage)
to expand entropy and mitigate repetition and truncation; second, we perform
Pre-GRPO: updating on a small, high-quality set of challenging
problems with a large rollout budget (64 rollouts per prompt) under a
hard-focus curriculum that continuously retains the most difficult instances
throughout training. We implement our method on Qwen2.5-32B and evaluate on
LeetCode and Codeforces weekly contests to avoid data leakage. The resulting
model achieves state-of-the-art performance among models of similar scale and
is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking.
We also examine scaling trends and observe strong RL scaling on an internal
large-scale MoE model. Our study distills concise best practices for data
curation, entropy expansion, and curriculum design in RLVR for
competitive-programming code generation.