曖昧性解消を中心としたファインチューニングにより、企業向けツール呼び出しLLMがより現実的でリスクの少ないものになる

要旨

大規模言語モデル（LLM）は、企業APIの呼び出しを担うことが増えているが、類似したツールが同じユーザー意図を競合する場合や、必要な引数が不十分に指定されている場合に、しばしば失敗する。本論文では、DiaFORGE（Dialogue Framework for Organic Response Generation & Evaluation）を紹介する。これは、曖昧さ解消を中心とした3段階のパイプラインで、(i) アシスタントが非常に類似したツールを区別しなければならないパーソナ駆動のマルチターン対話を合成し、(ii) 3Bから70Bパラメータにわたるオープンソースモデルを推論トレース付きで教師ありファインチューニングし、(iii) 各モデルをライブエージェントループに再デプロイし、従来の静的メトリクスとともにエンドツーエンドの目標達成度を報告する動的スイートで実世界での準備状況を評価する。我々の動的ベンチマークDiaBENCHにおいて、DiaFORGEでトレーニングされたモデルは、最適化されたプロンプト条件下で、GPT-4oに対して27ポイント、Claude-3.5-Sonnetに対して49ポイントのツール呼び出し成功率の向上を示した。さらなる研究を促進するため、厳密に検証された曖昧さ解消に焦点を当てた対話とペアリングされた5000のプロダクショングレードの企業API仕様のオープンコーパスを公開し、信頼性の高いエンタープライズ対応ツール呼び出しエージェントを構築するための実用的な青写真を提供する。

English

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

曖昧性解消を中心としたファインチューニングにより、企業向けツール呼び出しLLMがより現実的でリスクの少ないものになる

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

要旨

Support