從受訓者到訓練者:LLM設計的多智能體推理強化學習訓練環境
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
June 16, 2026
作者: Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo
cs.AI
摘要
大型語言模型(LLM)訓練中的強化學習管線,往往依賴於從業者在不同階段之間手動重新設計環境,需要他們以啟發式方式推斷哪種配置最能改善當前策略。為了自動化這一流程,我們提出「LLM 作為環境工程師」框架,其中當前策略模型會分析失敗軌跡與背景資訊,並提出下一階段訓練環境配置的修改方案。我們還引入了 MAPF-FrozenLake,一個可控的測試平台,其生成器可揭露多維度的環境配置,使其非常適合用於研究與基準測試環境重新設計。在此測試平台上,我們讓環境工程師以策略行為、失敗案例與環境統計資料的結構化摘要為條件,從而生成下一訓練階段的配置。以 Qwen3-4B 為基礎模型,我們的框架在基準測試中取得了最強的整體表現,優於更大的專有 LLM(如 GPT、Gemini)以及固定環境的訓練基線。我們進一步分析了哪些形式的背景資訊最有效,發現成功的環境更新依賴於失敗證據,並保留已正常運作的配置。有趣的是,當前的 RL 檢查點比原始基礎模型更適合作為環境工程師,這表明策略學習提升了模型診斷其剩餘弱點的能力。
English
Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.