CoVe: 制約誘導検証による対話型ツール利用エージェントの訓練

要旨

複数ターンにわたる対話型ツール利用エージェントの開発は、現実世界のユーザー要求が複雑で曖昧である一方、エージェントはそれらを満たすために確定的なアクションを実行しなければならないという課題を抱えています。このギャップを埋めるため、本論文では対話型ツール利用エージェントを訓練するための学習後データ合成フレームワーク「CoVe（Constraint-Verification）」を提案します。このフレームワークはデータの複雑性と正確性の両方を保証するように設計されています。CoVeはまず明示的なタスク制約を定義することから始まります。この制約は二つの役割を果たします：複雑な軌道の生成を導くガイドとして機能し、同時に軌道品質を評価するための確定的な検証器として働きます。これにより、教師ありファインチューニング（SFT）のための高品質な訓練軌道の作成と、強化学習（RL）のための正確な報酬信号の導出が可能になります。難易度の高いτ^2-benchベンチマークによる評価では、本フレームワークの有効性が実証されました。特に、コンパクトなCoVe-4Bモデルは、航空業界と小売業界のドメインにおいて、それぞれ43.0%、59.4%の成功率を達成しています。その総合的な性能は、同規模の強力なベースラインを大幅に上回り、最大17倍のサイズを持つモデル群にも引けを取りません。これらの結果は、CoVeが最先端の対話型ツール利用エージェント向けの訓練データを合成する、効果的かつ効率的な経路を提供することを示しています。将来の研究を支援するため、我々はコード、訓練済みモデル、および訓練に使用した12,000件の高品質な軌道の完全なデータセットをオープンソースとして公開します。

English

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging τ^2-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact CoVe-4B model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to 17times its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.

CoVe: 制約誘導検証による対話型ツール利用エージェントの訓練

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

要旨

Support