CoSPlay: 테스트 시 협력적 자기 대결을 위한 자가 생성 코드 및 단위 테스트

초록

최근 검증 가능 보상 기반 강화 학습(RLVR)과 테스트 시간 확장(TTS)은 실행 가능 검증을 통해 LLM 코드 생성을 발전시켜 왔다. 그러나 실측 단위 테스트(GT UT)는 여전히 병목 지점으로 남아 있다: 최첨단 RLVR 방법은 비용이 많이 드는 학습을 위해 이를 필요로 하는 반면, 기존 TTS 방법은 이러한 테스트 없이는 경쟁력을 잃는다. 이는 GT 불필요 TTS에 대한 동기를 부여하며, 기존 방법들은 자체 생성된 UT를 직접 사용하여 코드 후보를 정제하고 선택한다. 그러나 이러한 UT는 종종 노이즈가 있거나 잘못된 코드와 우연히 결합되며, 신뢰할 수 있는 코드 없이는 UT 품질도 검증할 수 없다. 따라서 핵심 과제는 두 가지를 동시에 개선하는 것이다. 이를 위해 우리는 CoSPlay를 제안한다. 이는 GT가 필요 없고 학습이 필요 없는 프레임워크로, 협력적 자가 대결(self-play)을 통해 코드와 UT를 공동으로 개선한다. 먼저 다양한 해결 아이디어를 탐색하고 잠재적인 실패 모드를 식별하여 변별력 있는 UT 아이디어를 생성한다. 그런 다음 코드-UT 실행 매트릭스의 양방향 통과 횟수 신호를 사용하여 약한 코드를 반복적으로 제거하거나 수정하고 신뢰할 수 없는 UT를 갱신하거나 대체함으로써 두 풀이 공진화하게 한다. 마지막으로 여러 코드가 가장 높은 통과 횟수에서 동점일 때, 가장 큰 출력 합의(output-consensus) 클러스터에서 최종 코드를 선택하는데, 올바른 코드는 동일한 입력에 대해 일치하는 반면 잘못된 코드는 분기하기 때문이다. 네 가지 까다로운 벤치마크에 대한 실험에서 Qwen2.5-7B-Instruct에 적용된 CoSPlay는 평균 BoN을 22.1%에서 33.2%로, UT 정확도를 14.6%에서 78.3%로 향상시켜 RLVR 모델 CURE-7B와 일치하거나 능가했다. CURE-7B에 적용했을 때는 BoN을 추가로 5.7% 개선했다. CoSPlay는 다양한 백본에서 일반화되며, 유사한 토큰 예산 하에서 GT 불필요 TTS 기준선을 능가하고 예산이 증가함에 따라 지속적인 이득을 보인다. 이러한 결과는 GT 데이터 없이 경쟁력 있는 코드 생성을 위한 확장 가능한 추론 전략을 시사한다.

English

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.