ToolSandbox: LLM 도구 사용 능력을 평가하기 위한 상태 유지, 대화형, 대화식 평가 벤치마크

초록

최근 대형 언어 모델 (LLMs)의 발전은 실제 세계의 문제를 해결하는 데 도움을 주는 도구 지원 LLMs에 대한 연구 관심을 촉발시켰으며, 이는 도구 사용 능력의 포괄적인 평가를 요구합니다. 이전 연구는 상태 없는 웹 서비스 (RESTful API)를 통한 평가 또는 단일 턴 사용자 프롬프트를 기반으로 한 것에 초점을 맞추었지만, ToolSandbox에는 상태를 가진 도구 실행, 도구 간의 암시적 상태 의존성, 온-폴리시 대화 평가를 지원하는 내장 사용자 시뮬레이터, 임의의 트라젝토리에 대한 중간 및 최종 마일스톤에 대한 동적 평가 전략이 포함되어 있습니다. 우리는 오픈 소스와 프로프리어터리 모델 간에 상당한 성능 차이가 있음을 보여주며, ToolSandbox에서 정의된 상태 의존성, 정규화 및 정보 부족과 같은 복잡한 작업은 가장 능숙한 SOTA LLMs조차 어렵다는 것을 보여주며, 도구 사용 LLM 능력에 대한 새로운 통찰을 제공합니다. ToolSandbox 평가 프레임워크는 https://github.com/apple/ToolSandbox에서 공개되었습니다.

English

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

ToolSandbox: LLM 도구 사용 능력을 평가하기 위한 상태 유지, 대화형, 대화식 평가 벤치마크

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

초록

Support