ToolLLM: 대규모 언어 모델이 16,000개 이상의 실세계 API를 마스터할 수 있도록 지원

초록

오픈소스 대규모 언어 모델(LLM)과 그 변형 모델들(예: LLaMA, Vicuna)의 발전에도 불구하고, 이러한 모델들은 외부 도구(API)를 사용하여 인간의 지시를 따르는 것과 같은 고수준 작업을 수행하는 데 여전히 상당한 한계를 보입니다. 이는 현재의 지시 튜닝(instruction tuning)이 기본적인 언어 작업에 초점을 맞추고 있으며, 도구 사용 영역에 충분히 집중하지 않기 때문입니다. 이는 최첨단(State-of-the-Art, SOTA) LLM들(예: ChatGPT)과 대조적입니다. ChatGPT는 뛰어난 도구 사용 능력을 보여주었지만, 아쉽게도 오픈소스가 아닙니다. 오픈소스 LLM 내에서 도구 사용 능력을 강화하기 위해, 우리는 ToolLLM을 소개합니다. ToolLLM은 데이터 구축, 모델 학습 및 평가를 위한 일반적인 도구 사용 프레임워크입니다. 먼저, 우리는 ChatGPT를 사용하여 자동으로 생성된 도구 사용을 위한 지시 튜닝 데이터셋인 ToolBench를 제시합니다. 구체적으로, 우리는 RapidAPI Hub에서 49개 카테고리에 걸친 16,464개의 실제 RESTful API를 수집한 후, ChatGPT를 활용하여 이러한 API를 포함한 다양한 인간 지시문을 생성합니다. 이는 단일 도구 및 다중 도구 시나리오를 모두 포함합니다. 마지막으로, ChatGPT를 사용하여 각 지시문에 대한 유효한 해결 경로(API 호출 체인)를 탐색합니다. 탐색 과정을 더 효율적으로 만들기 위해, 우리는 깊이 우선 탐색 기반 의사결정 트리(Depth-First Search-based Decision Tree, DFSDT)를 개발하여 LLM이 여러 추적 경로를 평가하고 탐색 공간을 확장할 수 있도록 합니다. 우리는 DFSDT가 LLM의 계획 및 추론 능력을 크게 향상시킨다는 것을 보여줍니다. 효율적인 도구 사용 평가를 위해, 우리는 자동 평가 도구인 ToolEval을 개발했습니다. 우리는 ToolBench를 사용하여 LLaMA를 미세 조정하고 ToolLLaMA를 얻었습니다. ToolEval을 통해 ToolLLaMA가 복잡한 지시문을 실행하고 보지 못한 API에 일반화하는 능력이 뛰어나며, ChatGPT와 비슷한 성능을 보인다는 것을 확인했습니다. 파이프라인을 더 실용적으로 만들기 위해, 우리는 각 지시문에 적합한 API를 추천하는 신경망 API 검색기를 설계하여 수동 API 선택의 필요성을 없앴습니다.

English

Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source. To facilitate tool-use capabilities within open-source LLMs, we introduce ToolLLM, a general tool-use framework of data construction, model training and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios. Finally, we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, we develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. We show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs. For efficient tool-use assessment, we develop an automatic evaluator: ToolEval. We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To make the pipeline more practical, we devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.

ToolLLM: 대규모 언어 모델이 16,000개 이상의 실세계 API를 마스터할 수 있도록 지원

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

초록

Support