EvoTrainer: 자율 에이전트 강화 학습을 위한 LLM 정책 및 훈련 하네스의 공동 진화

초록

자율적 LLM 훈련은 종종 레시피 탐색으로 간주되어 훈련 도구를 대체로 정적으로 유지한다. 이러한 한계는 에이전트 기반 강화학습(RL)에서 더욱 두드러지는데, 변화하는 병목 현상과 스칼라 보상이 다양한 실패 모드를 가리기 때문이다. 본 논문에서는 경험적 피드백을 통해 LLM 정책과 훈련 측 도구를 공동 진화시키는 자율적 훈련 프레임워크인 EvoTrainer를 소개한다. 이 프레임워크는 롤아웃 수준의 증거를 진단하고, 진단을 수정하며, 개입을 백테스트하고, 재사용 가능한 기술을 축적한다. 수학적 추론, 경쟁 프로그래밍 코드 생성, 저장소 수준 소프트웨어 엔지니어링에 대해 평가한 결과, EvoTrainer는 동일한 데이터, 코드베이스, 평가 프로토콜 하에서 인간이 설계한 RL 참조 모델과 동등하거나 더 나은 성능을 보였으며, 특히 장기적 에이전트 SWE에서 가장 큰 향상을 나타냈다. 궤적 분석 결과, 유지된 전략은 도메인 간에 분기하며, 진화하는 진단은 유효하지 않은 고득점 분기가 승격되는 것을 방지하고, 재사용 가능한 기술이 이후 검색을 형성하는 것으로 나타났다. 자율적 LLM RL은 레시피 탐색을 넘어 정책과 이를 해석하는 훈련 도구의 공동 진화로 나아가야 한다.

English

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.