Code-A1：基于强化学习的代码大语言模型与测试大语言模型对抗演化

摘要

基於單元測試通過率的可驗證獎勵是程式碼生成強化學習的基礎。然而高品質測試集稀缺、現有數據集覆蓋有限，且靜態獎勵無法隨模型進化而調整。近期自我博弈方法雖將程式碼與測試生成統一於單一模型，卻面臨固有困境：白盒訪問會導致模型為獲取簡單獎勵而生成瑣碎測試的自我合謀現象，而黑盒限制又會產生忽略實現細節缺陷的泛化測試。我們提出Code-A1對抗共進化框架，通過對立目標聯合優化程式碼大語言模型與測試大語言模型：程式碼模型以通過更多測試獲取獎勵，測試模型則以暴露更多缺陷為目標。這種架構分離不僅消除了自我合謀風險，更安全實現白盒測試生成——測試模型可審查候選程式碼以生成針對性對抗測試。我們進一步引入用於經驗回放的錯題本機制，以及平衡測試有效性與對抗難度的複合獎勵函數。在Qwen2.5-Coder模型上的實驗表明，Code-A1達成了媲美甚至超越人工標註測試訓練模型的程式碼生成性能，同時顯著提升了測試生成能力。

English

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

Code-A1：基于强化学习的代码大语言模型与测试大语言模型对抗演化

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

摘要

Support