CoVe:透過約束引導驗證訓練互動式工具使用代理
CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
March 2, 2026
作者: Jinpeng Chen, Cheng Gong, Hanbo Li, Ziru Liu, Zichen Tian, Xinyu Fu, Shi Wu, Chenyang Zhang, Wu Zhang, Suiyun Zhang, Dandan Tu, Rui Liu
cs.AI
摘要
開發多輪互動式工具使用代理面臨重大挑戰,因為現實世界的用戶需求往往複雜且模糊,但代理必須執行確定性操作來滿足這些需求。為解決這一難題,我們提出CoVe(約束驗證框架),這是一種專為訓練互動式工具使用代理而設計的後訓練數據合成框架,能同時確保數據複雜度與正確性。CoVe首先定義明確的任務約束條件,這些條件具有雙重作用:既指導複雜軌跡的生成,又作為評估軌跡質量的確定性驗證器。該機制能為監督微調(SFT)生成高質量訓練軌跡,並為強化學習(RL)提供精確的獎勵信號。我們在具有挑戰性的τ²-bench基準測試中的評估結果證明了該框架的有效性。值得注意的是,我們緊湊型的CoVe-4B模型在航空和零售領域分別達到43.0%和59.4%的成功率;其整體性能顯著優於同等規模的強基線模型,並可與規模達其17倍的大型模型相媲美。這些結果表明,CoVe為合成最先進互動式工具使用代理的訓練數據提供了高效途徑。為推動未來研究,我們開源了代碼、訓練模型以及用於訓練的12K條高質量完整軌跡數據集。
English
Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging τ^2-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact CoVe-4B model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to 17times its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.