ToolSandbox: Een Stateful, Conversational, Interactief Evaluatiebenchmark voor LLM Toolgebruikscapaciteiten

Samenvatting

Recente vooruitgang in grote taalmodellen (LLMs) heeft een groeiende onderzoeksinteresse gewekt in het oplossen van real-world uitdagingen met behulp van tool-geassisteerde LLMs, wat een uitgebreide evaluatie van tool-gebruikscapaciteiten vereist. Terwijl eerdere werken zich richtten op het evalueren van stateless webservices (RESTful API), gebaseerd op een enkele gebruikersprompt, of een off-policy dialoogtraject, omvat ToolSandbox stateful tool-uitvoering, impliciete staat-afhankelijkheden tussen tools, een ingebouwde gebruikerssimulator die on-policy conversatie-evaluatie ondersteunt, en een dynamische evaluatiestrategie voor tussenliggende en finale mijlpalen over een willekeurig traject. We tonen aan dat open-source en propriëtaire modellen een aanzienlijk prestatieverschil vertonen, en complexe taken zoals State Dependency, Canonicalization en Insufficient Information, zoals gedefinieerd in ToolSandbox, zelfs de meest capabele state-of-the-art LLMs uitdagen, wat nieuwe inzichten biedt in de tool-gebruikscapaciteiten van LLMs. Het ToolSandbox evaluatieframework is vrijgegeven op https://github.com/apple/ToolSandbox.

English

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

ToolSandbox: Een Stateful, Conversational, Interactief Evaluatiebenchmark voor LLM Toolgebruikscapaciteiten

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Samenvatting

Support