通过强化学习协同进化大语言模型编码器与单元测试器

摘要

我们提出了CURE，一种新颖的强化学习框架，其独特的奖励设计基于编码与单元测试生成能力的交互结果进行协同进化，无需任何真实代码作为监督。这一方法实现了灵活且可扩展的训练，使单元测试器能够直接从编码器的错误中学习。我们开发的ReasonFlux-Coder-7B和14B模型，在Qwen2.5-Instruct模型上优化后，代码生成准确率提升了5.3%，Best-of-N准确率提高了9.0%，超越了同等规模的Qwen-Coder、DeepSeek-Coder和Seed-Coder。这些模型自然延伸至下游任务，如测试时扩展和代理编码，相较于基础模型实现了8.1%的提升。对于长链思维（long-CoT）模型，我们的ReasonFlux-Coder-4B持续优于Qwen3-4B，同时在单元测试生成中达到了64.8%的推理效率。值得注意的是，我们还发现该模型可作为基础模型强化学习的有效奖励模型。项目地址：https://github.com/Gen-Verse/CURE。

English

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

通过强化学习协同进化大语言模型编码器与单元测试器

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

摘要

Support