TOBench: 実世界のツール使用エージェントのためのタスク指向型オムニモーダルベンチマーク

要旨

ツール使用エージェントは、現実的な専門的ワークフローでの運用がますます期待されるようになっており、その際、マルチモーダル入力を解釈し、外部ツールを調整し、中間成果物を検査し、最終結果を生成する前に行動を修正する必要がある。しかし、既存のベンチマークは、ツール使用、コンピュータ使用、マルチモーダル推論を個別に評価することが多く、ベンチマーク設定と現実世界でのエンドツーエンドの全モーダルツール使用との間にギャップが存在する。このギャップを埋めるため、我々はタスク指向型全モーダルツール使用のためのベンチマークおよび評価フレームワークであるMM-ToolBenchを導入する。MM-ToolBenchは、カスタマーサービスとインテリジェントクリエーションという2つのマクロタスクファミリーから100の実行可能タスクを含み、20のサブカテゴリにわたっており、27のMCPサーバー（324ツール）によってサポートされている。MM-ToolBenchの中核設計はクローズドループマルチモーダル検証である。エージェントはツールを実行し、レンダリングまたは変換された成果物を検査し、出力がタスク固有の要件を満たさない場合に自己修正を行わなければならない。このような評価をスケーラブルかつ検証可能にするため、MM-ToolBenchはMCPベースの実行と、タスク固有の根拠付き評価器、およびシナリオ発見、タスクインスタンス化、評価器合成、人間による監査のための半自動構築パイプラインを組み合わせている。15の最新エージェントモデルを用いた実験では、MM-ToolBenchが依然として非常に困難であることが示された。一般的に最強のコーディングエージェントモデルの一つとされるClaude Opus 4.6でも、タスク成功率はわずか32.0%であり、人間ベンチマークの94.0%を大きく下回っている。我々はMM-ToolBenchが、クローズドループマルチモーダル検証を通じて次世代の全モーダルツール使用エージェントを評価・進展させるための実用的基盤となることを想定している。

English

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.