CoVe:通过约束引导验证训练交互式工具使用智能体
CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
March 2, 2026
作者: Jinpeng Chen, Cheng Gong, Hanbo Li, Ziru Liu, Zichen Tian, Xinyu Fu, Shi Wu, Chenyang Zhang, Wu Zhang, Suiyun Zhang, Dandan Tu, Rui Liu
cs.AI
摘要
开发多轮交互式工具使用智能体具有挑战性,因为现实世界中的用户需求往往复杂且模糊,而智能体必须执行确定性操作来满足这些需求。为弥补这一差距,我们提出了CoVe(约束验证)——一种专为训练交互式工具使用智能体而设计的后训练数据合成框架,该框架在确保数据复杂性的同时兼顾正确性。CoVe首先定义明确的任务约束,这些约束具有双重作用:既指导复杂轨迹的生成,又作为确定性验证器评估轨迹质量。这种方法能够为监督微调(SFT)创建高质量训练轨迹,并为强化学习(RL)提供精确的奖励信号。在具有挑战性的τ²-bench基准测试中,我们的评估验证了该框架的有效性。值得注意的是,紧凑型CoVe-4B模型在航空和零售领域分别实现了43.0%和59.4%的成功率;其整体性能显著优于同等规模的强基线模型,并与体积达其17倍的模型保持竞争力。这些结果表明,CoVe为最先进的交互式工具使用智能体提供了一条高效的数据合成路径。为支持未来研究,我们开源了代码、训练模型以及用于训练的完整1.2万条高质量轨迹集。
English
Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging τ^2-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact CoVe-4B model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to 17times its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.