ChatPaper.aiChatPaper

EvoTrainer:用于自主智能体强化学习的LLM策略与训练框架的协同进化

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

June 2, 2026
作者: Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye
cs.AI

摘要

自主式大语言模型训练常被框架化为配方搜索,这导致训练框架基本保持静态。这一局限在智能体强化学习中尤为突出——动态瓶颈与标量奖励掩盖了多样化的失败模式。为此,我们提出EvoTrainer这一自主训练框架,通过经验反馈协同进化大语言模型策略与训练侧框架:它诊断轨迹级证据、修正诊断结果、回测干预措施,并积累可复用技能。在数学推理、竞赛级代码生成以及仓库级软件工程任务上的评估表明,在相同数据、代码库与评估协议下,EvoTrainer的表现达到或超越了人工设计的强化学习基准,其中在长周期智能体软件工程任务上增益最大。轨迹分析显示,保留的策略因领域而异,进化中的诊断机制可阻止无效高分分支的晋升,而可复用技能则塑造后续搜索方向。自主式大语言模型强化学习应超越配方搜索,迈向策略与解读策略的训练框架的协同进化。
English
Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.