工具对话：评估对话环境中的工具使用

摘要

大型语言模型（LLMs）展现出在推理和决策能力方面的巨大改进，能够与用户进行自然对话。许多最近的研究旨在通过外部工具增强基于LLM的助手，使其能够访问私人或最新信息，并代表用户执行操作。为了更好地衡量这些助手的性能，本文介绍了ToolTalk，一个由复杂用户意图组成的基准测试，需要通过对话指定多步工具使用。ToolTalk 包含 28 个工具，分为 7 个插件，并包括每个工具的完整模拟实现，从而实现对依赖执行反馈的助手进行完全自动化评估。ToolTalk 还强调那些对外部世界产生影响的工具，而不仅仅是用于引用或搜索信息的工具。我们在 ToolTalk 上评估了 GPT-3.5 和 GPT-4，结果显示成功率分别为 26% 和 50%。我们对错误进行了分析，发现了三个主要类别，并提出了一些未来改进的方向。我们在 https://github.com/microsoft/ToolTalk 上发布了 ToolTalk。

English

Large language models (LLMs) have displayed massive improvements in reason- ing and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents re- quiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementa- tion of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement. We release ToolTalk at https://github.com/microsoft/ToolTalk.

工具对话：评估对话环境中的工具使用

ToolTalk: Evaluating Tool-Usage in a Conversational Setting

摘要

Support