MTU-Bench: 大規模言語モデルのための多粒度ツール使用ベンチマーク

要旨

大規模言語モデル（LLMs）は、推論および意思決定能力において著しい改善を示し、ユーザーとの自然な会話を行うことができます。最近、多くのツール利用ベンチマークデータセットが提案されています。ただし、既存のデータセットには以下の制限があります：（1）評価シナリオが不十分（例：限られたツール利用シーンのみをカバー）。（2）評価コストが高額（例：GPT APIのコスト）。これらの制限に対処するために、本研究では、大規模言語モデル向けの多粒度ツール利用ベンチマークであるMTU-Benchを提案します。「多粒度」の特性により、当社のMTU-Benchは、5つのツール利用シーン（すなわち、単一ターンと単一ツール、単一ターンと複数ツール、複数ターンと単一ツール、複数ターンと複数ツール、および分布外タスク）をカバーしています。さらに、当社のMTU-Benchのすべての評価メトリクスは、GPTや人間の評価メトリクスを使用せず、予測結果とグラウンドトゥルースに基づいています。さらに、当社のMTU-Benchは、既存の高品質データセットを変換して実世界のツール利用シナリオをシミュレートすることで収集されており、既存のLLMsのツール利用能力を向上させるための指示データセットであるMTU-Instructデータも提案しています。包括的な実験結果は、当社のMTU-Benchの効果を示しています。コードとデータは、https://github.com/MTU-Bench-Team/MTU-Bench.git で公開されます。

English

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git.

MTU-Bench: 大規模言語モデルのための多粒度ツール使用ベンチマーク

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

要旨

Support