透過強化學習共同進化的大型語言模型編碼器與單元測試器

摘要

我們提出了CURE，這是一種新穎的強化學習框架，其專屬的獎勵設計基於編碼與單元測試生成能力的交互結果進行共同演化，無需任何真實代碼作為監督。這種方法實現了靈活且可擴展的訓練，並允許單元測試器直接從編碼者的錯誤中學習。我們衍生的ReasonFlux-Coder-7B和14B模型在Qwen2.5-Instruct模型上優化後，代碼生成準確率提升了5.3%，Best-of-N準確率提升了9.0%，超越了同規模的Qwen-Coder、DeepSeek-Coder和Seed-Coder。這些模型自然地延伸至下游任務，如測試時擴展和代理編碼，相比基礎模型實現了8.1%的改進。對於長鏈思維模型，我們的ReasonFlux-Coder-4B在單元測試生成中持續超越Qwen3-4B，同時達到了64.8%的推理效率。值得注意的是，我們還發現該模型可作為基礎模型強化學習的有效獎勵模型。項目詳情請見：https://github.com/Gen-Verse/CURE。

English

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

透過強化學習共同進化的大型語言模型編碼器與單元測試器

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

摘要

Support