工具對話：評估對話環境中的工具使用

摘要

大型語言模型（LLMs）展示了在推理和決策能力方面的巨大改進，能夠與用戶進行自然對話。許多最近的研究旨在通過外部工具來增強基於LLM的助手，使它們能夠訪問私人或最新信息並代表用戶執行操作。為了更好地衡量這些助手的性能，本文介紹了ToolTalk，這是一個基於對話指定的包含複雜用戶意圖並需要多步工具使用的基準測試。ToolTalk 包含了28個工具，分為7個插件，並包括每個工具的完整模擬實現，從而實現對依賴執行反饋的助手進行完全自動化評估。ToolTalk 還強調了那些對外部世界產生影響的工具，而不僅僅是用於參考或搜索信息的工具。我們在 ToolTalk 上評估了 GPT-3.5 和 GPT-4，其成功率分別為26％和50％。我們對錯誤進行了分析，發現了三個主要類別，並提出了一些未來改進的方向。我們在 https://github.com/microsoft/ToolTalk 上發布了 ToolTalk。

English

Large language models (LLMs) have displayed massive improvements in reason- ing and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents re- quiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementa- tion of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement. We release ToolTalk at https://github.com/microsoft/ToolTalk.

工具對話：評估對話環境中的工具使用

ToolTalk: Evaluating Tool-Usage in a Conversational Setting

摘要

Support