以消歧為核心的微調使企業工具調用型大語言模型更貼近現實且降低風險

摘要

大型語言模型（LLMs）越來越多地被賦予調用企業API的任務，然而當近乎重複的工具爭奪同一用戶意圖或所需參數未明確指定時，它們往往表現不佳。我們引入了DiaFORGE（對話框架用於有機響應生成與評估），這是一個以消歧為核心的三階段流程，它（i）合成基於角色的多輪對話，其中助手必須區分高度相似的工具，（ii）對開源模型進行監督微調，涵蓋3B至70B參數範圍的推理軌跡，以及（iii）通過動態套件評估現實世界的準備情況，該套件將每個模型重新部署在實時代理循環中，並報告端到端目標完成情況以及傳統的靜態指標。在我們的動態基準DiaBENCH上，使用DiaFORGE訓練的模型在工具調用成功率上比GPT-4o提高了27個百分點，比Claude-3.5-Sonnet提高了49個百分點，兩者均在優化提示下進行。為了促進進一步研究，我們發布了一個包含5000個生產級企業API規範的開放語料庫，並配備了經過嚴格驗證、以消歧為重點的對話，為構建可靠、企業就緒的工具調用代理提供了實用藍圖。

English

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

以消歧為核心的微調使企業工具調用型大語言模型更貼近現實且降低風險

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

摘要

Support