ChatPaper.aiChatPaper

工具十項全能:評測語言代理在多樣化、真實性與長時程任務執行中的表現

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

October 29, 2025
作者: Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He
cs.AI

摘要

現實世界中的語言智慧體必須能夠處理跨多種應用程式的複雜多步驟工作流程。例如,一個智慧體可能需要透過與日曆和檔案系統協調來管理電子郵件,或是監控生產資料庫以檢測異常情況,並依照操作手冊生成報告。然而,現有的語言智慧體基準測試往往聚焦於狹窄領域或簡化任務,缺乏評估智慧體真實世界表現所需的多樣性、真實性及長時程複雜性。為填補這一空白,我們推出工具十項全能(Toolathlon)基準測試,為語言智慧體提供多樣化的應用程式與工具、真實的環境設定,以及可靠的基於執行的評估機制。Toolathlon涵蓋32個軟體應用程式和604種工具,範圍從Google日曆和Notion等日常平台,到WooCommerce、Kubernetes和BigQuery等專業工具。大多數工具基於我們修訂或自行實現的高品質模型上下文協定(MCP)伺服器。有別於先前主要確保功能真實性但環境狀態多樣性有限的研究,我們從真實軟體中提供具現實意義的初始環境狀態,例如包含數十名學生的Canvas課程或真實財務電子表格。該基準測試共包含108項手動採集或精心設計的任務,平均需要約20次互動回合才能完成多應用程式協作。每項任務均可透過專用評估腳本進行嚴格驗證。對頂尖模型的綜合評估凸顯其明顯不足:表現最佳的Claude-4.5-Sonnet模型成功率僅達38.6%,平均工具呼叫次數為20.2次,而頂級開源權重模型DeepSeek-V3.2-Exp的成功率為20.1%。我們期待Toolathlon能推動建構更具實用價值的語言智慧體,以執行現實世界中的長時程任務。
English
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
PDF451December 2, 2025