EvoTrainer：共同演化大型語言模型策略與訓練框架以實現自主智能體強化學習

摘要

自主式大型語言模型（LLM）訓練常被框架化為配方搜尋，導致訓練框架（harness）大致保持靜態。此限制在代理型強化學習（agentic RL）中尤為尖銳，因為動態變化的瓶頸與稀疏的標量回報會掩蓋多樣的失敗模式。我們提出 EvoTrainer，一個透過經驗回饋共同演化 LLM 策略與訓練端框架（harness）的自主訓練框架：它診斷推演層級的證據、修訂診斷結果、回測干預措施，並累積可重複使用的技能。在數學推理、競賽程式碼生成，以及儲存庫層級的軟體工程評估中，EvoTrainer 在相同資料、程式碼庫與評估協議下，匹配或超越了人工設計的強化學習基準，其中在長時程代理型軟體工程（SWE）任務上獲得最大增益。軌跡分析顯示，保留的策略在不同領域間產生分歧；演化中的診斷機制能避免將無效的高分分支提升為有效；而可重複使用的技能則影響後續的搜尋。自主式 LLM 強化學習應超越配方搜尋，邁向策略與解讀策略之訓練框架的共同演化。

English

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.