CoVe: 제약 조건 기반 검증을 통한 상호작용적 도구 사용 에이전트 훈련

초록

다중 턴 상호작용 도구 사용 에이전트를 개발하는 것은 현실 세계의 사용자 요구가 종종 복잡하고 모호함에도 불구하고, 에이전트가 이를 충족시키기 위해 결정론적 행동을 실행해야 하기 때문에 어려운 과제입니다. 이러한 간극을 해결하기 위해 우리는 데이터의 복잡성과 정확성을 모두 보장하면서 상호작용 도구 사용 에이전트를 훈련시키기 위한 사후 훈련 데이터 합성 프레임워크인 CoVe(Constraint-Verification)를 소개합니다. CoVe는 명시적 작업 제약 조건을 정의하는 것으로 시작하며, 이 제약 조건은 복잡한 궤적 생성의 지침 역할과 궤적 품질 평가를 위한 결정론적 검증자 역할이라는 이중 기능을 수행합니다. 이를 통해 지도 미세 조정(SFT)을 위한 고품질 훈련 궤적을 생성하고 강화 학습(RL)을 위한 정확한 보상 신호를 도출할 수 있습니다. 까다로운 τ^2-bench 벤치마크에 대한 평가를 통해 본 프레임워크의 효과성을 입증했습니다. 특히 소규모 CoVe-4B 모델은 항공 및 리테일 도메인에서 각각 43.0%, 59.4%의 성공률을 기록했으며, 전체 성능은 유사 규모의 강력한 베이스라인을 크게 앞섰고 크기가 17배에 달하는 모델들과도 경쟁력을 보였습니다. 이러한 결과는 CoVe가 최첨단 상호작용 도구 사용 에이전트를 위한 훈련 데이터를 효과적이고 효율적으로 합성하는 경로를 제공함을 시사합니다. 향후 연구를 지원하기 위해 우리는 코드, 훈련된 모델, 그리고 훈련에 사용된 12,000개의 고품질 궤적 전체 세트를 오픈소스로 공개합니다.

English

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging τ^2-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact CoVe-4B model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to 17times its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.

CoVe: 제약 조건 기반 검증을 통한 상호작용적 도구 사용 에이전트 훈련

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

초록

Support