강화 학습을 통한 LLM 코더와 유닛 테스터의 공동 진화

초록

우리는 CURE라는 새로운 강화 학습 프레임워크를 제안합니다. 이 프레임워크는 전용 보상 설계를 통해 코드 생성과 단위 테스트 생성 능력을 상호작용 결과에 기반하여 공동으로 진화시키며, 어떠한 정답 코드도 감독으로 사용하지 않습니다. 이 접근 방식은 유연하고 확장 가능한 학습을 가능하게 하며, 단위 테스터가 코더의 실수로부터 직접 학습할 수 있도록 합니다. 우리가 도출한 ReasonFlux-Coder-7B 및 14B 모델은 Qwen2.5-Instruct 모델에 대한 최적화 후 코드 생성 정확도를 5.3%, Best-of-N 정확도를 9.0% 향상시켜, 비슷한 규모의 Qwen-Coder, DeepSeek-Coder 및 Seed-Coder를 능가합니다. 이 모델들은 테스트 시간 스케일링 및 에이전트 코딩과 같은 다운스트림 작업으로 자연스럽게 확장되어 기본 모델 대비 8.1%의 개선을 달성합니다. long-CoT 모델의 경우, 우리의 ReasonFlux-Coder-4B는 Qwen3-4B를 꾸준히 능가하면서 단위 테스트 생성에서 64.8%의 추론 효율성을 달성합니다. 특히, 우리 모델이 기본 모델에 대한 강화 학습의 효과적인 보상 모델로도 사용될 수 있음을 발견했습니다. 프로젝트: https://github.com/Gen-Verse/CURE

English

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

강화 학습을 통한 LLM 코더와 유닛 테스터의 공동 진화

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

초록

Support