ToolSandbox：一個具有狀態、對話式、互動式的評估基準，用於評估LLM工具的使用能力。

摘要

最近大型語言模型（LLMs）的進步引發了對工具輔助LLMs解決現實世界挑戰的研究興趣不斷增加，這需要對工具使用能力進行全面評估。過去的研究主要集中在評估基於無狀態 Web 服務（RESTful API）的情況，基於單輪用戶提示，或者基於離線對話軌跡，而 ToolSandbox 則包括有狀態的工具執行、工具之間的隱式狀態依賴、內置用戶模擬器支持基於策略的對話評估，以及針對任意軌跡的中間和最終里程碑的動態評估策略。我們展示了開源和專有模型之間存在顯著的性能差距，並且像在 ToolSandbox 中定義的狀態依賴、規範化和信息不足等複雜任務，即使是最具備能力的 SOTA LLMs 也面臨挑戰，為工具使用LLMs能力提供了全新的見解。ToolSandbox 評估框架已在 https://github.com/apple/ToolSandbox 釋出。

English

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

ToolSandbox：一個具有狀態、對話式、互動式的評估基準，用於評估LLM工具的使用能力。

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

摘要

Support