TRACE: 역량 기반 에이전트 훈련

초록

에이전트 환경에 배포된 대규모 언어 모델(LLM)은 다양한 작업 인스턴스에 걸쳐 여러 역량을 발휘해야 합니다. 여기서 역량이란 환경 내 특정 작업 하위 집합을 성공적으로 해결하는 데 필요한 트라젝토리 내 하나 이상의 행동을 수행하는 것을 의미합니다. 기존의 많은 접근법은 목표 환경에서 모델이 실제로 부족한 역량에 맞춰지지 않은 합성 훈련 데이터에 의존하거나, 모델이 작업 전반에 걸쳐 역량을 암묵적으로 학습해야 하는 목표 환경에서 직접 훈련하는 방식을 취합니다. 본 논문에서는 환경 특화 에이전트 자기 개선을 위한 종단 간(end-to-end) 시스템인 TRACE(에이전트 실패를 역량 중심 훈련 환경으로 전환)를 소개합니다. TRACE는 성공 및 실패 트라젝토리를 대조하여 부족한 역량을 자동으로 식별하고, 각 역량의 발휘 여부를 보상하는 표적 훈련 환경을 합성하며, 각 합성 환경에서 RL을 통해 LoRA 어댑터를 훈련하고 추론 시 관련 어댑터로 라우팅합니다. 실험적으로 TRACE는 다양한 환경에서 일반화 성능을 발휘하여, τ²-bench(고객 서비스)에서 기준 에이전트 대비 +14.1점, ToolSandbox(도구 사용)에서 +7개의 만점 점수를 기록하며, 가장 강력한 베이스라인을 각각 +7.4점, +4개의 만점 점수로 능가했습니다. 동일한 롤아웃 횟수 대비 TRACE는 베이스라인보다 효율적으로 확장되었으며, τ²-bench에서 GRPO와 GEPA를 각각 +9.2점, +7.4점 앞섰습니다.

English

Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model's actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on τ^2-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on τ^2-bench.

TRACE: 역량 기반 에이전트 훈련

TRACE: Capability-Targeted Agentic Training

초록

Support