ToolSandbox: ステートフルで会話型のインタラクティブな評価ベンチマークによるLLMツール利用能力の検証

要旨

近年の大規模言語モデル（LLMs）の進展により、ツールを活用したLLMsが現実世界の課題を解決するための研究が活発化しており、ツール使用能力の包括的な評価が求められています。従来の研究は、ステートレスなウェブサービス（RESTful API）に基づく単一ターンのユーザープロンプト、またはオフポリシーの対話軌跡に焦点を当てていましたが、ToolSandboxは、ステートフルなツール実行、ツール間の暗黙的な状態依存関係、オンポリシーの会話評価をサポートする組み込みユーザーシミュレータ、および任意の軌跡における中間および最終的なマイルストーンに対する動的な評価戦略を包含しています。我々は、オープンソースモデルとプロプライエタリモデルとの間に大きな性能差があることを示し、ToolSandboxで定義された「状態依存性」「正規化」「情報不足」といった複雑なタスクが、最も優れたSOTA LLMsにとっても挑戦的であることを明らかにし、ツール使用LLMsの能力に関する新たな知見を提供します。ToolSandboxの評価フレームワークは、https://github.com/apple/ToolSandbox で公開されています。

English

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

ToolSandbox: ステートフルで会話型のインタラクティブな評価ベンチマークによるLLMツール利用能力の検証

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

要旨

Support