Code-A1: 強化学習によるコードLLMとテストLLMの敵対的進化

要旨

コード生成のための強化学習は、単体テストの合格率に基づく検証可能な報酬に依存している。しかし、高品質なテストスイートは不足しており、既存のデータセットの網羅性は限られ、静的報酬はモデルの改善に適応できない。最近のセルフプレイ手法はコード生成とテスト生成を単一モデルで統合するが、根本的なジレンマに直面する。すなわち、ホワイトボックスアクセスはモデルが容易な報酬を得るための自明なテストを生成する「自己共謀」を引き起こし、一方ブラックボックス制限は実装固有のバグを見逃す汎用的なテストしか生成しない。本稿では、Code-A1を提案する。これは、対立する目的を持つコードLLMとテストLLMを共同で最適化する敵対的共進化フレームワークである。コードLLMはより多くのテストを通過することで報酬を受け、テストLLMはより多くの欠陥を曝露することで報酬を受ける。このアーキテクチャ上の分離は自己共謀のリスクを排除し、テストLLMが候補コードを検査して標的型の敵対的テストを作成できるホワイトボックステスト生成を安全に可能にする。さらに、経験再生のための「誤り帳」メカニズムと、テストの有効性と敵対的難易度を均衡させる複合報酬を導入する。Qwen2.5-Coderモデルを用いた実験により、Code-A1が人手で注釈付けされたテストで学習したモデルに匹敵あるいは超越するコード生成性能を達成しつつ、テスト生成能力を大幅に向上させることを実証する。

English

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

Code-A1: 強化学習によるコードLLMとテストLLMの敵対的進化

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

要旨

Support