ChatPaper.aiChatPaper

ToolSandbox:一個具有狀態、對話式、互動式的評估基準,用於評估LLM工具的使用能力。

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

August 8, 2024
作者: Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang
cs.AI

摘要

最近大型語言模型(LLMs)的進步引發了對工具輔助LLMs解決現實世界挑戰的研究興趣不斷增加,這需要對工具使用能力進行全面評估。過去的研究主要集中在評估基於無狀態 Web 服務(RESTful API)的情況,基於單輪用戶提示,或者基於離線對話軌跡,而 ToolSandbox 則包括有狀態的工具執行、工具之間的隱式狀態依賴、內置用戶模擬器支持基於策略的對話評估,以及針對任意軌跡的中間和最終里程碑的動態評估策略。我們展示了開源和專有模型之間存在顯著的性能差距,並且像在 ToolSandbox 中定義的狀態依賴、規範化和信息不足等複雜任務,即使是最具備能力的 SOTA LLMs 也面臨挑戰,為工具使用LLMs能力提供了全新的見解。ToolSandbox 評估框架已在 https://github.com/apple/ToolSandbox 釋出。
English
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

Summary

AI-Generated Summary

PDF184November 28, 2024