ChatPaper.aiChatPaper

追蹤:能力導向的智能體訓練

TRACE: Capability-Targeted Agentic Training

April 7, 2026
作者: Hangoo Kang, Tarun Suresh, Jon Saad-Falcon, Azalia Mirhoseini
cs.AI

摘要

在代理環境中部署的大型語言模型(LLMs)必須在不同任務實例中展現多種能力,此處的「能力」指在執行軌跡中完成一個或多個動作,這些動作是成功解決環境中特定任務子集的必要條件。現有方法多數依賴於非針對模型在目標環境中實際能力缺陷的合成訓練數據,或直接在目標環境上訓練,這要求模型隱式地跨任務學習能力。我們提出TRACE(將循環代理失敗轉化為能力導向訓練環境),這是一個針對特定環境的代理自我改進端到端系統。TRACE通過對比成功與失敗的執行軌跡,自動識別缺失的能力,為每項缺失能力合成具有針對性的訓練環境(該環境會對能力運用情況進行獎勵),並通過強化學習在每個合成環境中訓練LoRA適配器,在推理時路由至相關適配器。實驗結果表明,TRACE能跨環境泛化:在τ²-bench(客服場景)上較基礎代理提升14.1分,在ToolSandbox(工具使用)上獲得+7次滿分成績,分別以+7.4分和+4次滿分成績超越最強基線模型。在相同軌跡採樣次數下,TRACE相比基線方法具備更高效的擴展性,在τ²-bench上分別以+9.2分和+7.4分的優勢超越GRPO與GEPA方法。
English
Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model's actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on τ^2-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on τ^2-bench.
PDF111April 15, 2026