MTU-Bench：用于大型语言模型的多粒度工具使用基准测试

摘要

大型语言模型（LLMs）在推理和决策能力方面取得了巨大进步，并能够与用户进行自然对话。最近，许多工具使用基准数据集已被提出。然而，现有数据集存在以下限制：（1）评估场景不足（例如，仅涵盖有限的工具使用场景）。（2）评估成本高昂（例如，GPT API成本）。为了解决这些限制，在这项工作中，我们提出了一个针对大型语言模型的多粒度工具使用基准，称为MTU-Bench。对于“多粒度”属性，我们的MTU-Bench涵盖了五种工具使用场景（即，单轮单工具、单轮多工具、多轮单工具、多轮多工具和分布任务）。此外，我们的MTU-Bench的所有评估指标都基于预测结果和基本事实，而不使用任何GPT或人类评估指标。此外，我们的MTU-Bench是通过转换现有高质量数据集来模拟真实世界的工具使用场景收集的，并且我们还提出了一个名为MTU-Instruct数据集的指导数据集，以增强现有LLMs的工具使用能力。全面的实验结果证明了我们MTU-Bench的有效性。代码和数据将在https://github.com/MTU-Bench-Team/MTU-Bench.git上发布。

English

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git.

MTU-Bench：用于大型语言模型的多粒度工具使用基准测试

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

摘要

Support