MTU-Bench：用於大型語言模型的多粒度工具使用基準测试

摘要

大型語言模型（LLMs）展示了在推理和決策能力方面的巨大改進，能夠與用戶進行自然對話。最近，許多工具使用基準數據集已被提出。然而，現有數據集存在以下限制：（1）評估場景不足（例如，僅涵蓋有限的工具使用場景）。（2）廣泛的評估成本（例如，GPT API成本）。為了解決這些限制，在這項工作中，我們提出了一個針對大型語言模型的多粒度工具使用基準，名為MTU-Bench。對於“多粒度”特性，我們的MTU-Bench涵蓋了五種工具使用場景（即單輪單工具、單輪多工具、多輪單工具、多輪多工具和分布之外的任務）。此外，我們的MTU-Bench的所有評估指標都基於預測結果和基準真相，而不使用任何GPT或人類評估指標。此外，我們的MTU-Bench是通過轉換現有高質量數據集來模擬真實世界的工具使用場景而收集的，我們還提出了一個名為MTU-Instruct數據集的指導數據集，以增強現有LLMs的工具使用能力。全面的實驗結果證明了我們的MTU-Bench的有效性。代碼和數據將在https://github.com/MTU-Bench-Team/MTU-Bench.git上發布。

English

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git.

MTU-Bench：用於大型語言模型的多粒度工具使用基準测试

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

摘要

Support