CoSPlay：测试时基于自生成代码与单元测试的协作自我博弈

摘要

近期，可验证奖励的强化学习（RLVR）与测试时扩展（TTS）通过可执行验证推进了LLM代码生成。然而，真实单元测试（GT UTs）仍是瓶颈：最先进的RLVR方法需要GT UTs进行昂贵训练，而现有TTS方法若缺乏GT UTs将失去竞争力。这推动了无GT的TTS研究，现有方法直接使用自生成单元测试来优化和筛选代码候选。然而，此类单元测试往往包含噪声或与错误代码存在虚假关联，且因缺乏可靠代码而无法验证单元测试质量。因此，关键挑战在于同时改进两者。为此，我们提出CoSPlay——一种无需GT、无需训练的框架，通过合作性自我博弈共同优化代码与单元测试。该方法首先探索多样化解题思路，识别其潜在失败模式以生成判别性单元测试思路；随后利用代码-单元测试执行矩阵中的双向通过计数信号，迭代剪枝或修复薄弱代码，更新或替换不可靠单元测试，使两个池协同进化。最终，当多个代码在最高通过计数上并列时，从最大输出一致性聚类中选取最终代码——因为正确代码对相同输入产生一致输出，而错误代码则产生分歧。在四个挑战性基准上的实验表明，基于Qwen2.5-7B-Instruct的CoSPlay将平均BoN从22.1%提升至33.2%，单元测试准确率从14.6%提升至78.3%，匹配甚至超越RLVR模型CURE-7B。当应用于CURE-7B时，BoN进一步提升5.7%。CoSPlay还可泛化至不同骨干模型，在可比令牌预算下优于无GT的TTS基线，且随预算扩展持续增益。这些结果表明，无需任何GT数据即可实现具有竞争力的代码生成的可扩展推理策略。

English

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.