TOBench：面向真實世界工具使用智能體的任務導向全模態基準

摘要

工具型智能代理日益被期望能夠在現實的專業工作流程中運作，在這些流程中，它們必須解讀多模態輸入、協調外部工具、檢視中間產出，並在最終成果產出前修正其行動。然而，現有的基準測試往往孤立地評估工具使用、電腦操作與多模態推理，導致基準設定與真實世界中端到端的全模態工具使用之間存在落差。為填補此落差，我們提出 MM-ToolBench——一個針對任務導向的全模態工具使用基準測試與評估平台。MM-ToolBench 包含來自兩大任務類別（客戶服務與智能創作）的100項可執行任務，涵蓋20個子類別分項，並由27個 MCP 伺服器（提供324項工具）支援。MM-ToolBench 的核心設計在於閉環多模態驗證：代理必須執行工具、檢視經渲染或轉換的成品，並在輸出未達任務特定要求時自行修正。為使此類評估可擴展且可驗證，MM-ToolBench 將基於 MCP 的執行流程與任務特定的紮根評估器，以及一個半自動化建構管線（涵蓋場景探索、任務實例化、評估器合成與人工審核）相結合。針對15個當代代理模型的實驗顯示，MM-ToolBench 仍極具挑戰性：通常被視為最強程式碼代理模型之一的 Claude Opus 4.6，任務成功率僅達32.0%，遠低於人類基準的94.0%。我們期望 MM-ToolBench 能作為一個實用基礎，透過閉環多模態驗證來評估並推動次世代全模態工具型代理的進步。

English

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.