从受训者到训练者：面向多智能体推理的LLM设计强化学习训练环境

摘要

用于大语言模型训练的强化学习流水线，往往需要在不同训练阶段之间手动重新设计环境，迫使从业者通过启发式推断哪种配置最能改进当前策略。为自动化这一过程，我们提出"LLM即环境工程师"框架——在该框架中，当前策略模型会分析故障轨迹与情境信息，并提出下一阶段训练环境配置的修改建议。我们还引入MAPF-FrozenLake这一可控测试平台，其生成器暴露了多维环境配置，适合用于研究与基准测试环境重设计工作。在此测试平台上，我们基于策略行为的结构化摘要、失败案例与环境统计数据，为环境工程师提供条件信息，使其生成下一训练阶段的配置。以Qwen3-4B为骨干模型，我们的框架在基准测试中取得了最强的综合表现，超越了更大的专有LLM（如GPT、Gemini）以及固定环境的训练基线。我们进一步分析了哪种情境信息最为有效，发现成功的环境更新依赖于失败证据，并会保留已有的有效配置。有趣的是，当前的强化学习检查点作为环境工程师的表现优于原始基础模型，这表明策略学习提升了模型诊断自身剩余弱点的能力。

English

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.