ToolLLM：協助大型語言模型掌握16000多個真實世界API

摘要

儘管開源大型語言模型（LLM）及其變體，例如LLaMA和Vicuna，已取得重大進展，但在執行較高層次任務時仍存在明顯限制，例如遵循人類指令使用外部工具（API）。這是因為目前的指令調整主要集中在基本語言任務而非工具使用領域。這與最先進的LLM（如ChatGPT）形成對比，後者展示出出色的工具使用能力，但遺憾的是其為封閉源碼。為了促進開源LLM的工具使用能力，我們引入了ToolLLM，這是一個通用的工具使用框架，包括數據構建、模型訓練和評估。我們首先提出了ToolBench，這是一個用於工具使用的指令調整數據集，使用ChatGPT自動創建。具體來說，我們從RapidAPI Hub收集了16,464個涵蓋49個類別的真實世界RESTful API，然後提示ChatGPT生成涉及這些API的多樣人類指令，包括單工具和多工具情境。最後，我們使用ChatGPT為每個指令尋找有效解決方案路徑（API調用鏈）。為了使搜索過程更有效，我們開發了一種基於深度優先搜索的決策樹（DFSDT），使LLM能夠評估多個推理軌跡並擴展搜索空間。我們展示DFSDT顯著增強了LLM的規劃和推理能力。為了進行有效的工具使用評估，我們開發了一個自動評估器：ToolEval。我們在ToolBench上對LLaMA進行微調，得到ToolLLaMA。我們的ToolEval顯示，ToolLLaMA展現出執行複雜指令和對未見API進行泛化的卓越能力，並表現出與ChatGPT相當的性能。為了使流程更實用，我們設計了一個神經API檢索器，為每個指令推薦適當的API，消除了手動選擇API的需求。

English

Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source. To facilitate tool-use capabilities within open-source LLMs, we introduce ToolLLM, a general tool-use framework of data construction, model training and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios. Finally, we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, we develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. We show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs. For efficient tool-use assessment, we develop an automatic evaluator: ToolEval. We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To make the pipeline more practical, we devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.

ToolLLM：協助大型語言模型掌握16000多個真實世界API

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

摘要

Support