TRACE：面向能力目标的智能体训练

摘要

在智能体环境中部署的大语言模型（LLM）需具备跨任务实例的多维能力，其中能力被定义为在轨迹中执行一个或多个对成功解决环境内任务子集至关重要的操作。现有方法大多依赖与模型在目标环境中实际能力缺陷不匹配的合成训练数据，或直接在目标环境上进行训练——这要求模型隐式学习跨任务能力。我们提出TRACE（将重复性智能体失败转化为能力导向的训练环境），这是一种面向特定环境的智能体自我提升端到端系统。TRACE通过对比成功与失败轨迹来自动识别缺失能力，为每种能力合成具有能力运用奖励机制的目标训练环境，并利用强化学习在各合成环境中训练LoRA适配器，在推理阶段路由至相应适配器。实验表明，TRACE能泛化至不同环境：在τ²-bench（客服场景）上较基线智能体提升14.1个点，在ToolSandbox（工具调用）上获得7次满分，分别以7.4个点和4次满分的优势超越最强基线。在同等轨迹采样次数下，TRACE展现出更高效的扩展性，在τ²-bench上以9.2和7.4个点的优势超越GRPO与GEPA基线。

English

Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model's actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on τ^2-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on τ^2-bench.