TRACE: 能力指向型エージェント学習

要旨

エージェント環境に配置された大規模言語モデル（LLM）は、異なるタスクインスタンスにわたって複数の能力を発揮する必要がある。ここで能力とは、環境内のタスクのサブセットを成功裏に解決するために必要な軌道上の1つ以上のアクションを実行することを指す。既存の多くのアプローチは、対象環境におけるモデルの実際の能力不足に焦点を当てていない合成的な訓練データに依存するか、あるいはモデルがタスク間で能力を暗黙的に学習する必要がある対象環境での直接訓練に依存している。本論文では、環境固有のエージェント自己改善のためのエンドツーエンドシステムであるTRACE（Turning Recurrent Agent failures into Capability-targeted training Environments）を提案する。TRACEは、成功軌道と失敗軌道を対比させて不足する能力を自動的に特定し、各能力が発揮されたかどうかを報酬とする標的型訓練環境を合成し、各合成的環境でRLを用いてLoRAアダプタを訓練し、推論時に適切なアダプタにルーティングする。実験により、TRACEは異なる環境間で一般化し、ベースエージェントと比較してτ^2-bench（カスタマーサービス）で+14.1ポイント、ToolSandbox（ツール使用）で完全スコア+7を改善し、最も強力なベースラインをそれぞれ+7.4ポイント、完全スコア+4で上回った。同一のロールアウト数では、TRACEはベースラインよりも効率的にスケールし、τ^2-benchにおいてGRPOおよびGEPAを+9.2ポイント、+7.4ポイント上回った。

English

Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model's actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on τ^2-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on τ^2-bench.

TRACE: 能力指向型エージェント学習

TRACE: Capability-Targeted Agentic Training

要旨

Support