ToolLLM：促进大型语言模型掌握16000多个真实世界的API

摘要

尽管开源大型语言模型（LLM）及其变种，例如LLaMA和Vicuna，取得了显著进展，但它们在执行更高级任务方面仍然存在明显局限，例如遵循人类指令以使用外部工具（API）。这是因为当前指令调整主要集中在基本语言任务上，而非工具使用领域。这与最先进的LLM（如ChatGPT）形成对比，后者展示出出色的工具使用能力，但遗憾的是它们是闭源的。为了在开源LLM中实现工具使用能力，我们引入了ToolLLM，这是一个通用的工具使用框架，包括数据构建、模型训练和评估。我们首先提出了ToolBench，这是一个用于工具使用的指令调整数据集，通过使用ChatGPT自动生成。具体而言，我们从RapidAPI Hub收集了16,464个涵盖49个类别的真实世界RESTful API，然后提示ChatGPT生成涉及这些API的多样人类指令，涵盖单工具和多工具场景。最后，我们使用ChatGPT为每个指令搜索有效解决方案路径（API调用链）。为了使搜索过程更高效，我们开发了一种基于深度优先搜索的决策树（DFSDT），使LLM能够评估多个推理轨迹并扩展搜索空间。我们展示了DFSDT显著增强了LLM的规划和推理能力。为了进行高效的工具使用评估，我们开发了一个自动评估器：ToolEval。我们在ToolBench上对LLaMA进行微调，得到了ToolLLaMA。我们的ToolEval显示，ToolLLaMA表现出执行复杂指令和泛化到未见API的显著能力，并且表现与ChatGPT相当。为了使流程更加实用，我们设计了一个神经API检索器，为每个指令推荐适当的API，消除了手动API选择的需要。

English

Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source. To facilitate tool-use capabilities within open-source LLMs, we introduce ToolLLM, a general tool-use framework of data construction, model training and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios. Finally, we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, we develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. We show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs. For efficient tool-use assessment, we develop an automatic evaluator: ToolEval. We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To make the pipeline more practical, we devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.

ToolLLM：促进大型语言模型掌握16000多个真实世界的API

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

摘要

Support