ToolLLM: 大規模言語モデルが16,000以上の実世界APIを習得するための支援

要旨

オープンソースの大規模言語モデル（LLMs）とその派生モデル（例：LLaMAやVicuna）の進歩にもかかわらず、外部ツール（API）を使用するための人間の指示に従うといった高度なタスクを実行する能力は依然として大きく制限されています。これは、現在の指示チューニングが基本的な言語タスクに焦点を当てており、ツール使用の領域に重点を置いていないためです。これは、ChatGPTのような最先端（SOTA）のLLMsとは対照的です。これらのモデルは優れたツール使用能力を実証していますが、残念ながらクローズドソースです。オープンソースのLLMs内でツール使用能力を促進するために、我々はToolLLMを紹介します。これは、データ構築、モデルトレーニング、評価を包括する一般的なツール使用フレームワークです。まず、ChatGPTを使用して自動的に作成されたツール使用のための指示チューニングデータセットであるToolBenchを提示します。具体的には、RapidAPI Hubから49カテゴリにわたる16,464の実世界のRESTful APIを収集し、ChatGPTにこれらのAPIを含む多様な人間の指示を生成させ、単一ツールと複数ツールのシナリオをカバーします。最後に、ChatGPTを使用して各指示に対する有効な解決パス（API呼び出しの連鎖）を検索します。検索プロセスをより効率的にするために、深さ優先探索ベースの決定木（DFSDT）を開発し、LLMsが複数の推論トレースを評価し、検索空間を拡張できるようにします。DFSDTがLLMsの計画と推論能力を大幅に向上させることを示します。効率的なツール使用評価のために、自動評価ツールであるToolEvalを開発します。ToolBenchでLLaMAをファインチューニングし、ToolLLaMAを取得します。ToolEvalの評価により、ToolLLaMAが複雑な指示を実行し、未見のAPIに一般化する顕著な能力を示し、ChatGPTと同等の性能を発揮することが明らかになりました。パイプラインをより実用的にするために、各指示に適切なAPIを推薦するニューラルAPIリトリーバーを考案し、手動でのAPI選択の必要性をなくしました。

English

Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source. To facilitate tool-use capabilities within open-source LLMs, we introduce ToolLLM, a general tool-use framework of data construction, model training and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios. Finally, we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, we develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. We show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs. For efficient tool-use assessment, we develop an automatic evaluator: ToolEval. We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To make the pipeline more practical, we devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.

ToolLLM: 大規模言語モデルが16,000以上の実世界APIを習得するための支援

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

要旨

Support