訓練生から指導者へ：マルチエージェント推論を用いた強化学習のためのLLM設計訓練環境

要旨

大規模言語モデル（LLM）の学習における強化学習パイプラインでは、各段階間で手動による環境の再設計に依存することが多く、実践者は現在の方策を最も改善できる構成をヒューリスティックに推測する必要がある。このプロセスを自動化するために、我々は「LLM-as-Environment-Engineer」フレームワークを提案する。これは、現在の方策モデルが失敗軌跡とコンテキスト情報を分析し、次段階の学習環境構成に対する修正案を出力するものである。また、MAPF-FrozenLakeも導入する。これは制御可能なテストベッドであり、その生成器が多次元の環境構成を公開するため、環境再設計の研究とベンチマークに適している。このテストベッド上で、我々は環境エンジニアに対し、方策の動作、失敗事例、環境統計に関する構造化されたサマリを条件として与え、そこから次段階の学習構成を生成させる。Qwen3-4Bをバックボーンとすることで、本フレームワークはベンチマークにおいて最も高い総合成績を達成し、より大規模なプロプライエタリLLM（例：GPT、Gemini）や固定環境での学習ベースラインを上回った。さらに、どの形式のコンテキストが最も効果的かを分析した結果、環境の更新の成功には失敗の証拠が有効であり、既に機能している構成は維持されることがわかった。興味深いことに、学習中のRLチェックポイントは元のベースモデルよりも優れた環境エンジニアとして機能しており、方策学習がモデルの自身の残存する弱点を診断する能力を向上させることを示唆している。

English

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.