CoVe: Training van Interactieve Gereedschapsgebruik-Agenten via Beperking-Gestuurde Verificatie

Samenvatting

Het ontwikkelen van multi-turn interactieve tool-use agents is uitdagend omdat gebruikersbehoeften in de praktijk vaak complex en ambigu zijn, terwijl agents deterministische acties moeten uitvoeren om deze te vervullen. Om deze kloof te overbruggen, introduceren we CoVe (Constraint-Verification), een post-training data-synthese framework ontworpen voor het trainen van interactieve tool-use agents met waarborging van zowel datacomplexiteit als -correctheid. CoVe begint met het definiëren van expliciete taakrestricties, die een dubbele rol vervullen: ze sturen de generatie van complexe trajecten en fungeren als deterministische verificatiemiddelen voor het beoordelen van trajectkwaliteit. Dit maakt de creatie van hoogwaardige trainings-trajecten voor supervised fine-tuning (SFT) en de afleiding van accurate beloningssignalen voor reinforcement learning (RL) mogelijk. Onze evaluatie op de uitdagende τ²-bench benchmark toont de effectiviteit van het framework aan. Opmerkelijk is dat ons compacte CoVe-4B-model slagingspercentages behaalt van respectievelijk 43,0% en 59,4% in de Airline- en Retail-domeinen; de algehele prestaties overtreffen significant sterke baseline-modellen van vergelijkbare schaal en blijven competitief met modellen tot 17 keer zo groot. Deze resultaten tonen aan dat CoVe een effectief en efficiënt pad biedt voor het synthetiseren van trainingsdata voor state-of-the-art interactieve tool-use agents. Om toekomstig onderzoek te ondersteunen, open-sourcen we onze code, het getrainde model en de volledige set van 12K hoogwaardige trajecten gebruikt voor training.

English

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging τ^2-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact CoVe-4B model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to 17times its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.

CoVe: Training van Interactieve Gereedschapsgebruik-Agenten via Beperking-Gestuurde Verificatie

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Samenvatting

Support