ToolLLM:協助大型語言模型掌握16000多個真實世界API
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
July 31, 2023
作者: Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
儘管開源大型語言模型(LLM)及其變體,例如LLaMA和Vicuna,已取得重大進展,但在執行較高層次任務時仍存在明顯限制,例如遵循人類指令使用外部工具(API)。這是因為目前的指令調整主要集中在基本語言任務而非工具使用領域。這與最先進的LLM(如ChatGPT)形成對比,後者展示出出色的工具使用能力,但遺憾的是其為封閉源碼。為了促進開源LLM的工具使用能力,我們引入了ToolLLM,這是一個通用的工具使用框架,包括數據構建、模型訓練和評估。我們首先提出了ToolBench,這是一個用於工具使用的指令調整數據集,使用ChatGPT自動創建。具體來說,我們從RapidAPI Hub收集了16,464個涵蓋49個類別的真實世界RESTful API,然後提示ChatGPT生成涉及這些API的多樣人類指令,包括單工具和多工具情境。最後,我們使用ChatGPT為每個指令尋找有效解決方案路徑(API調用鏈)。為了使搜索過程更有效,我們開發了一種基於深度優先搜索的決策樹(DFSDT),使LLM能夠評估多個推理軌跡並擴展搜索空間。我們展示DFSDT顯著增強了LLM的規劃和推理能力。為了進行有效的工具使用評估,我們開發了一個自動評估器:ToolEval。我們在ToolBench上對LLaMA進行微調,得到ToolLLaMA。我們的ToolEval顯示,ToolLLaMA展現出執行複雜指令和對未見API進行泛化的卓越能力,並表現出與ChatGPT相當的性能。為了使流程更實用,我們設計了一個神經API檢索器,為每個指令推薦適當的API,消除了手動選擇API的需求。
English
Despite the advancements of open-source large language models (LLMs) and
their variants, e.g., LLaMA and Vicuna, they remain significantly limited in
performing higher-level tasks, such as following human instructions to use
external tools (APIs). This is because current instruction tuning largely
focuses on basic language tasks instead of the tool-use domain. This is in
contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have
demonstrated excellent tool-use capabilities but are unfortunately closed
source. To facilitate tool-use capabilities within open-source LLMs, we
introduce ToolLLM, a general tool-use framework of data construction, model
training and evaluation. We first present ToolBench, an instruction-tuning
dataset for tool use, which is created automatically using ChatGPT.
Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories
from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions
involving these APIs, covering both single-tool and multi-tool scenarios.
Finally, we use ChatGPT to search for a valid solution path (chain of API
calls) for each instruction. To make the searching process more efficient, we
develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs
to evaluate multiple reasoning traces and expand the search space. We show that
DFSDT significantly enhances the planning and reasoning capabilities of LLMs.
For efficient tool-use assessment, we develop an automatic evaluator: ToolEval.
We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that
ToolLLaMA demonstrates a remarkable ability to execute complex instructions and
generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To
make the pipeline more practical, we devise a neural API retriever to recommend
appropriate APIs for each instruction, negating the need for manual API
selection.