エージェントを微調整するだけでなく、環境を調整せよ

要旨

大規模言語モデル（LLM）エージェントは、複雑で多段階のツール使用タスクにおいて大きな可能性を示していますが、その開発は高品質な訓練データの極端な不足によってしばしば妨げられています。合成データを用いた教師ありファインチューニング（SFT）は過学習を引き起こし、標準的な強化学習（RL）は重要なコールドスタート問題と訓練の不安定性に苦戦します。これらの課題に対処するため、我々は「環境チューニング」という新しい訓練パラダイムを導入します。このアプローチにより、エージェントは事前に収集された専門家の軌跡に依存せず、問題インスタンスから直接複雑な振る舞いを学習できます。環境チューニングは、構造化されたカリキュラム、修正フィードバックを提供する実践的な環境拡張、そして安定した効率的な探索を保証する細かい進捗報酬を通じて、この学習プロセスを調整します。Berkeley Function-Calling Leaderboard（BFCL）ベンチマークからわずか400の問題インスタンスを使用して、我々の手法は強力なベースラインに対して競争力のある分布内性能を達成するだけでなく、SFTベースのアプローチに共通する性能低下を克服し、優れた分布外汎化性能を示します。我々の研究は、静的な軌跡に対する教師ありファインチューニングから、動的な環境ベースの探索へのパラダイムシフトを提示し、より堅牢でデータ効率の良いエージェントの訓練への道を開きます。

English

Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce Environment Tuning, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. Environment Tuning orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.

エージェントを微調整するだけでなく、環境を調整せよ

Don't Just Fine-tune the Agent, Tune the Environment

要旨

Support