ToolTalk: 会話型設定におけるツール使用の評価

要旨

大規模言語モデル（LLM）は、推論と意思決定のスキルにおいて大幅な改善を示し、ユーザーと自然な会話をすることができます。最近の多くの研究では、LLMベースのアシスタントを外部ツールで拡張し、プライベートな情報や最新の情報にアクセスし、ユーザーの代わりにアクションを実行できるようにすることを目指しています。これらのアシスタントのパフォーマンスをより適切に測定するために、本論文ではToolTalkを紹介します。これは、対話を通じて指定された多段階のツール使用を必要とする複雑なユーザー意図からなるベンチマークです。ToolTalkは7つのプラグインにグループ化された28のツールを含み、各ツールの完全なシミュレーション実装を備えており、実行フィードバックに依存するアシスタントの完全自動評価を可能にします。ToolTalkはまた、情報を参照または検索するためのツールだけでなく、外部で世界に影響を与えるツールを重視しています。ToolTalkでGPT-3.5とGPT-4を評価した結果、それぞれ26％と50％の成功率が得られました。エラーの分析からは、3つの主要なカテゴリが明らかになり、今後の改善の方向性が示唆されています。ToolTalkはhttps://github.com/microsoft/ToolTalkで公開されています。

English

Large language models (LLMs) have displayed massive improvements in reason- ing and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents re- quiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementa- tion of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement. We release ToolTalk at https://github.com/microsoft/ToolTalk.

ToolTalk: 会話型設定におけるツール使用の評価

ToolTalk: Evaluating Tool-Usage in a Conversational Setting

要旨

Support