ChatPaper.aiChatPaper

DRIVE:具可驗證獎勵的競爭性程式碼生成強化學習資料管理最佳實踐

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

November 9, 2025
作者: Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou
cs.AI

摘要

近期以推理為先的模型(如OpenAI o1、DeepSeek R1)推動了強化學習與驗證推理(RLVR)的復興。然而,相關進展主要由數學領域(如AIME)主導,競爭性程式設計的程式碼生成研究相對不足,且資料處理所獲關注遠少於強化學習演算法設計。本研究探討如何建構RLVR資料集(即RL提示詞),並提出實用訓練技術,在競爭性程式設計程式碼生成任務中實現強勁表現。我們的流程始於從強力開源模型蒸餾而來的監督式微調(SFT),並輔以通用性與高強度推理資料進行增強。後續強化學習採用雙階段流程,以可執行的測試案例驅動獎勵機制:第一階段使用群組相對策略優化(GRPO),在大量均勻分佈的競爭性程式設計題庫上訓練,每道提示詞生成8次回應,並設定較短的回應生成視窗(如SFT階段32k,本階段24k),以擴大熵值並緩解重複與截斷問題;第二階段實施預訓練GRPO:在精選的小型高難度題庫上,採用每道提示詞64次回應的大規模預算,透過持續保留最難實例的硬性聚焦課程進行更新。我們將該方法實作於Qwen2.5-32B模型,並在LeetCode與Codeforces週賽中評估以避免資料洩漏。最終模型在同等規模模型中達到最先進效能,與DeepSeek v3.1、Doubao-1.5-Thinking等頂尖系統表現相當。我們同時驗證擴展規律,在內部大規模混合專家模型上觀察到顯著的強化學習擴展效應。本研究總結出針對競爭性程式設計程式碼生成的RLVR資料處理、熵擴展與課程設計的簡明最佳實踐。
English
Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform Pre-GRPO: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.
PDF505December 2, 2025