以消歧为核心的微调使企业工具调用大语言模型更贴近实际且降低风险
Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
July 4, 2025
作者: Ashutosh Hathidara, Julien Yu, Sebastian Schreiber
cs.AI
摘要
大型语言模型(LLMs)正越来越多地被赋予调用企业API的任务,然而当近乎重复的工具争夺同一用户意图或所需参数未明确指定时,它们往往表现不佳。我们推出了DiaFORGE(对话框架用于有机响应生成与评估),这是一个以消歧为核心的三阶段流程,它(i)合成基于角色的多轮对话,其中助手必须区分高度相似的工具,(ii)对开源模型进行监督微调,涵盖3B至70B参数范围内的推理轨迹,以及(iii)通过动态套件评估实际准备情况,该套件将每个模型重新部署在实时代理循环中,并报告端到端目标完成情况以及传统的静态指标。在我们的动态基准测试DiaBENCH上,采用DiaFORGE训练的模型在工具调用成功率上比GPT-4o提高了27个百分点,比Claude-3.5-Sonnet提高了49个百分点,两者均在优化提示下进行。为了推动进一步研究,我们发布了一个包含5000个生产级企业API规范的开放语料库,并配以经过严格验证、聚焦消歧的对话,为构建可靠、企业就绪的工具调用代理提供了实用蓝图。
English
Large language models (LLMs) are increasingly tasked with invoking enterprise
APIs, yet they routinely falter when near-duplicate tools vie for the same user
intent or when required arguments are left underspecified. We introduce
DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a
disambiguation-centric, three-stage pipeline that (i) synthesizes
persona-driven, multi-turn dialogues in which the assistant must distinguish
among highly similar tools, (ii) performs supervised fine-tuning of open-source
models with reasoning traces across 3B - 70B parameters, and (iii) evaluates
real-world readiness via a dynamic suite that redeploys each model in a live
agentic loop and reports end-to-end goal completion alongside conventional
static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE
raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over
Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we
release an open corpus of 5000 production-grade enterprise API specifications
paired with rigorously validated, disambiguation-focused dialogues, offering a
practical blueprint for building reliable, enterprise-ready tool-calling
agents.