透過強化學習共同進化的大型語言模型編碼器與單元測試器
Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning
June 3, 2025
作者: Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang
cs.AI
摘要
我們提出了CURE,這是一種新穎的強化學習框架,其專屬的獎勵設計基於編碼與單元測試生成能力的交互結果進行共同演化,無需任何真實代碼作為監督。這種方法實現了靈活且可擴展的訓練,並允許單元測試器直接從編碼者的錯誤中學習。我們衍生的ReasonFlux-Coder-7B和14B模型在Qwen2.5-Instruct模型上優化後,代碼生成準確率提升了5.3%,Best-of-N準確率提升了9.0%,超越了同規模的Qwen-Coder、DeepSeek-Coder和Seed-Coder。這些模型自然地延伸至下游任務,如測試時擴展和代理編碼,相比基礎模型實現了8.1%的改進。對於長鏈思維模型,我們的ReasonFlux-Coder-4B在單元測試生成中持續超越Qwen3-4B,同時達到了64.8%的推理效率。值得注意的是,我們還發現該模型可作為基礎模型強化學習的有效獎勵模型。項目詳情請見:https://github.com/Gen-Verse/CURE。
English
We propose CURE, a novel reinforcement learning framework with a dedicated
reward design that co-evolves coding and unit test generation capabilities
based on their interaction outcomes, without any ground-truth code as
supervision. This approach enables flexible and scalable training and allows
the unit tester to learn directly from the coder's mistakes. Our derived
ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and
Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models,
outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They
naturally extend to downstream tasks such as test-time scaling and agentic
coding-achieving a 8.1% improvement over the base model. For the long-CoT
model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while
achieving 64.8% inference efficiency in unit test generation. Notably, we also
find that our model can serve as an effective reward model for reinforcement
learning on base models. Project: https://github.com/Gen-Verse/CURE