工具十项全能:面向多样化、现实性及长周期任务执行的语言智能体基准测试
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
October 29, 2025
作者: Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He
cs.AI
摘要
现实世界中的语言智能体必须能够处理跨多样应用程序的复杂多步骤工作流。例如,智能体可能需要通过协调日历和文件系统来管理电子邮件,或是根据操作手册监控生产数据库以检测异常并生成报告。然而,现有语言智能体基准测试往往聚焦于狭窄领域或简化任务,缺乏评估智能体真实世界性能所需的多样性、真实性和长程复杂性。为弥补这一空白,我们推出工具十项全能(简称Toolathlon)——一个为语言智能体提供的涵盖多样化应用程序与工具、真实环境设置及可靠执行评估的基准测试体系。Toolathlon横跨32个软件应用程序和604种工具,涵盖从Google日历、Notion等日常平台到WooCommerce、Kubernetes和BigQuery等专业工具。大部分工具基于我们修订或自行实现的高质量模型上下文协议(MCP)服务器构建。与先前主要确保功能真实性但环境状态多样性有限的研究不同,我们提供了来自真实软件的初始环境状态,例如包含数十名学生的Canvas课程系统或真实财务报表。该基准测试共包含108项手动采集或精心设计的任务,平均需要跨约20轮交互完成多应用操作。每项任务均可通过专用评估脚本进行严格验证。对前沿模型的综合评估揭示了其显著缺陷:表现最佳的Claude-4.5-Sonnet模型成功率仅为38.6%,平均调用工具20.2次;而顶级开源模型DeepSeek-V3.2-Exp的成功率为20.1%。我们期待Toolathlon能推动开发出更适用于现实世界长程任务执行的语言智能体。
English
Real-world language agents must handle complex, multi-step workflows across
diverse Apps. For instance, an agent may manage emails by coordinating with
calendars and file systems, or monitor a production database to detect
anomalies and generate reports following an operating manual. However, existing
language agent benchmarks often focus on narrow domains or simplified tasks
that lack the diversity, realism, and long-horizon complexity required to
evaluate agents' real-world performance. To address this gap, we introduce the
Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering
diverse Apps and tools, realistic environment setup, and reliable
execution-based evaluation. Toolathlon spans 32 software applications and 604
tools, ranging from everyday platforms such as Google Calendar and Notion to
professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools
are based on a high-quality set of Model Context Protocol (MCP) servers that we
may have revised or implemented ourselves. Unlike prior works, which primarily
ensure functional realism but offer limited environment state diversity, we
provide realistic initial environment states from real software, such as Canvas
courses with dozens of students or real financial spreadsheets. This benchmark
includes 108 manually sourced or crafted tasks in total, requiring interacting
with multiple Apps over around 20 turns on average to complete. Each task is
strictly verifiable through dedicated evaluation scripts. Comprehensive
evaluation of SOTA models highlights their significant shortcomings: the
best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate
with 20.2 tool calling turns on average, while the top open-weights model
DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development
of more capable language agents for real-world, long-horizon task execution.