CoSPlay：測試時利用自生成程式碼與單元測試的合作式自我對弈

摘要

近期，可验证奖励强化学习（RLVR）与测试时扩展（TTS）已通过可执行验证推动了LLM代码生成技术的进步。然而，真实单元测试（GT UTs）仍构成瓶颈：最先进的RLVR方法需依赖它们进行高成本训练，而现有TTS方法若缺乏GT则会失去竞争力。这催生了无GT的TTS方法——现有方法直接使用自生成的UT来优化并筛选代码候选方案。然而，此类UT常包含噪声或与错误代码产生伪耦合，而UT质量本身也无法在缺乏可靠代码的情况下得到验证。因此，关键挑战在于如何同时提升两者。针对这一问题，我们提出CoSPlay——一种无需GT、无需训练的框架，通过合作式自我对弈同时优化代码与UT。该方法首先探索多样化解题思路，识别其潜在失败模式以生成具有区分性的UT思路；随后利用代码-UT执行矩阵中的双向通过计数信号，迭代性地剪枝或修复薄弱代码，更新或替换不可靠的UT，使两池共同演化。最后，当多个代码在最高通过计数上出现并列时，从输出一致性最大的聚类中选出最终代码——因为正确代码对相同输入达成一致，而错误代码则存在分歧。在四个具有挑战性的基准测试上的实验表明，CoSPlay在Qwen2.5-7B-Instruct上平均BoN从22.1%提升至33.2%，UT准确率从14.6%提升至78.3%，匹配或超越了RLVR模型CURE-7B。应用于CURE-7B时，BoN进一步提升了5.7%。CoSPlay还能在多种骨干模型上泛化，在可比token预算下优于无GT的TTS基线，且随预算扩展持续获得增益。这些结果表明，无需任何GT数据即可实现具有竞争力的代码生成的规模化推理策略。

English

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.