ToolSandbox:一個具有狀態、對話式、互動式的評估基準,用於評估LLM工具的使用能力。
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
August 8, 2024
作者: Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang
cs.AI
摘要
最近大型語言模型(LLMs)的進步引發了對工具輔助LLMs解決現實世界挑戰的研究興趣不斷增加,這需要對工具使用能力進行全面評估。過去的研究主要集中在評估基於無狀態 Web 服務(RESTful API)的情況,基於單輪用戶提示,或者基於離線對話軌跡,而 ToolSandbox 則包括有狀態的工具執行、工具之間的隱式狀態依賴、內置用戶模擬器支持基於策略的對話評估,以及針對任意軌跡的中間和最終里程碑的動態評估策略。我們展示了開源和專有模型之間存在顯著的性能差距,並且像在 ToolSandbox 中定義的狀態依賴、規範化和信息不足等複雜任務,即使是最具備能力的 SOTA LLMs 也面臨挑戰,為工具使用LLMs能力提供了全新的見解。ToolSandbox 評估框架已在 https://github.com/apple/ToolSandbox 釋出。
English
Recent large language models (LLMs) advancements sparked a growing research
interest in tool assisted LLMs solving real-world challenges, which calls for
comprehensive evaluation of tool-use capabilities. While previous works focused
on either evaluating over stateless web services (RESTful API), based on a
single turn user prompt, or an off-policy dialog trajectory, ToolSandbox
includes stateful tool execution, implicit state dependencies between tools, a
built-in user simulator supporting on-policy conversational evaluation and a
dynamic evaluation strategy for intermediate and final milestones over an
arbitrary trajectory. We show that open source and proprietary models have a
significant performance gap, and complex tasks like State Dependency,
Canonicalization and Insufficient Information defined in ToolSandbox are
challenging even the most capable SOTA LLMs, providing brand-new insights into
tool-use LLM capabilities. ToolSandbox evaluation framework is released at
https://github.com/apple/ToolSandboxSummary
AI-Generated Summary