EvoTrainer: LLMポリシーと訓練ハーネスの共進化による自律エージェント強化学習

要旨

自律的なLLM訓練はしばしばレシピ探索として位置づけられ、訓練ハーネスは大部分が静的である。この制約は、エージェンティック強化学習において特に顕著となる。そこでは、変化するボトルネックとスカラー報酬が多様な失敗モードを隠蔽する。本稿では、経験的フィードバックを通じてLLMポリシーと訓練側ハーネスを共進化させる自律的訓練フレームワーク「EvoTrainer」を提案する。具体的には、ロールアウトレベルの証拠を診断し、診断結果を修正し、介入策のバックテストを実施し、再利用可能なスキルを蓄積する。数学的推論、競技プログラミングのコード生成、リポジトリレベルのソフトウェア工学の各タスクで評価した結果、EvoTrainerは同一データ、コードベース、評価プロトコルの条件下で、人手設計の強化学習ベースラインと同等以上の性能を示し、特に長期的なエージェンティックソフトウェア工学において最大の改善を達成した。軌道分析により、保持された戦略が領域ごとに分岐すること、進化的診断が無効な高スコア分岐の昇格を防止すること、再利用可能なスキルが後の探索を形成することが明らかになった。自律的なLLM強化学習は、レシピ探索を超え、ポリシーとそれを解釈する訓練ハーネスの共同進化へと移行すべきである。

English

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.