훈련생에서 훈련자로: LLM이 설계한 다중 에이전트 추론 기반 강화 학습 훈련 환경

초록

대규모 언어 모델(LLM) 훈련을 위한 강화 학습 파이프라인은 종종 단계 간 환경을 수동으로 재설계해야 하며, 실무자는 어떤 구성이 현재 정책을 가장 효과적으로 개선할지 경험적으로 추론해야 한다. 이러한 과정을 자동화하기 위해, 우리는 현재 정책 모델이 실패 궤적을 맥락 정보와 함께 분석하고 다음 단계 훈련 환경 구성을 제안하는 LLM-as-Environment-Engineer 프레임워크를 제안한다. 또한 다차원 환경 구성을 노출하는 생성기를 갖춘 제어 가능한 테스트베드인 MAPF-FrozenLake를 소개하며, 이는 환경 재설계 연구 및 벤치마킹에 적합하다. 이 테스트베드에서 우리는 환경 엔지니어를 정책 행동, 실패 사례 및 환경 통계에 대한 구조화된 요약에 조건화하여, 이로부터 다음 훈련 단계의 구성을 도출한다. Qwen3-4B를 백본으로 사용한 우리의 프레임워크는 벤치마크에서 가장 강력한 종합 성능을 달성했으며, 더 큰 독점적 LLM(예: GPT, Gemini) 및 고정 환경 훈련 기준선을 능가했다. 또한 어떤 형태의 맥락이 가장 효과적인지 분석한 결과, 성공적인 환경 업데이트는 실패 증거에 의존하며 이미 작동하는 구성을 유지한다는 점을 발견했다. 흥미롭게도, 현재 RL 체크포인트는 원래 기본 모델보다 더 나은 환경 엔지니어 역할을 수행했으며, 이는 정책 학습이 모델의 남은 약점을 진단하는 능력을 향상시킴을 시사한다.

English

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.