Code-A1：基于强化学习的代码大模型与测试大模型对抗演化

摘要

基于强化学习的代码生成技术长期依赖单元测试通过率作为可验证奖励。然而高质量测试套件稀缺，现有数据集覆盖范围有限，且静态奖励机制难以适配模型能力的持续提升。近期自博弈方法虽将代码与测试生成统一于单一模型，却面临固有困境：白盒访问会导致模型通过生成简单测试获取奖励的自我共谋问题，而黑盒限制又只能产生无法捕捉实现特异性缺陷的通用测试。我们提出Code-A1对抗协同进化框架，通过联合优化目标对立的代码大模型与测试大模型实现突破。代码模型以获得更高测试通过率为奖励，测试模型则以暴露更多缺陷为目标。这种架构分离从根本上消除了自我共谋风险，使测试模型能安全采用白盒生成模式——通过检视候选代码来构造针对性对抗测试。我们进一步引入错题本机制实现经验回放，并设计平衡测试有效性与对抗难度的复合奖励函数。在Qwen2.5-Coder模型上的实验表明，Code-A1在代码生成性能上达到甚至超越基于人工标注测试训练的模型，同时显著提升了测试生成能力。

English

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.