CoSPlay: テスト時間における自己生成コードとユニットテストを用いた協調的自己対戦

要旨

近年、検証可能な報酬を用いた強化学習（RLVR）とテスト時スケーリング（TTS）は、実行可能な検証を通じてLLMのコード生成を進歩させてきた。しかし、正解単体テスト（GT UT）は依然としてボトルネックである。最先端のRLVR手法は高コストな学習にGT UTを必要とし、既存のTTS手法はGT UTがなければ競争力を失う。このことがGTフリーのTTSへの動機付けとなる。そこでは既存手法が自己生成したUTを直接利用してコード候補を改善・選択する。しかし、そのようなUTはノイズが多く、誤ったコードと擬似的に結合していることが多く、また、信頼できるコードなしにはUTの品質を検証できない。したがって、重要な課題は両者を同時に改善することである。この目的のために、我々はCoSPlayを提案する。これはGTフリーかつ学習不要の枠組みであり、協調的自己対戦を通じてコードとUTを同時に改善する。まず多様な解法アイデアを探索し、その潜在的な失敗モードを特定して識別力のあるUTアイデアを生成する。次に、コードとUTの実行行列から得られる双方向のパスカウント信号を用いて、弱いコードを反復的に枝刈り・修正し、信頼できないUTを更新・置換することで、二つのプールを共進化させる。最後に、最高パスカウントで複数のコードが同点となった場合、正しいコードは同一の入力で一致するが誤ったコードは乖離するという性質を利用し、最大の出力コンセンサスクラスタから最終コードを選択する。4つの難易度の高いベンチマークによる実験では、Qwen2.5-7B-Instruct上のCoSPlayは平均BoNを22.1%から33.2%に、UT精度を14.6%から78.3%に向上させ、RLVRモデルCURE-7Bと同等またはそれを上回った。CURE-7Bに適用した場合、BoNをさらに5.7%改善する。CoSPlayは多様なバックボーンにも汎化し、同等のトークン予算下でGTフリーのTTSベースラインを凌駕し、予算の増加に伴ってさらなる向上が見られる。これらの結果は、GTデータを一切必要とせず、競争力のあるコード生成のためのスケーラブルな推論戦略を示唆している。

English

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.