TOBench:面向真实世界工具使用智能体的任务导向全模态基准
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
May 16, 2026
作者: Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si
cs.AI
摘要
工具使用智能体日益需要在真实职业工作流中运行,它们必须解读多模态输入、协调外部工具、检查中间工件,并在生成最终结果前修正自身行为。然而现有基准评测往往将工具使用、计算机操作与多模态推理割裂评估,导致基准设定与现实世界中端到端全模态工具使用之间存在差距。为弥合这一鸿沟,我们提出MM-ToolBench——一个面向任务型全模态工具使用的基准评测框架。该基准包含来自客户服务与智能创作两大宏任务家族的100个可执行任务,覆盖20个子类别,由27个MCP服务器(含324个工具)提供支持。MM-ToolBench的核心设计是闭环多模态验证:智能体必须执行工具、检查渲染或转换后的工件,并在输出不符合任务具体要求时进行自我修正。为使此类评测具备可扩展性和可验证性,MM-ToolBench将基于MCP的执行与任务特定的具象评估器相结合,并构建了从场景发现、任务实例化、评估器综合到人工审核的半自动化流水线。在15个当代智能体模型上的实验表明,MM-ToolBench仍极具挑战性:通常被视为最强编码智能体模型之一的Claude Opus 4.6仅达到32.0%的任务成功率,远低于94.0%的人类基准水平。我们期望MM-ToolBench能通过闭环多模态验证机制,成为评估和推动下一代全模态工具使用智能体发展的实用基础。
English
Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.