TOBench: 실세계 도구 사용 에이전트를 위한 과제 지향 옴니모달 벤치마크

초록

도구 사용 에이전트는 실제 전문 작업 흐름에서 작동해야 하는 경우가 점점 더 많아지고 있으며, 이 과정에서 멀티모달 입력을 해석하고, 외부 도구를 조정하며, 중간 결과물을 검사하고, 최종 결과를 생성하기 전에 동작을 수정해야 합니다. 그러나 기존 벤치마크는 도구 사용, 컴퓨터 사용 및 멀티모달 추론을 개별적으로 평가하는 경우가 많아, 벤치마크 환경과 실제 세계에서의 종단간 전방위(omni-modal) 도구 사용 간에 격차가 존재합니다. 이러한 격차를 해소하기 위해 우리는 작업 지향적 전방위 도구 사용을 위한 벤치마크 및 평가 도구인 MM-ToolBench를 소개합니다. MM-ToolBench는 고객 서비스와 지능형 창작이라는 두 가지 거시 작업군에서 추출한 100개의 실행 가능한 작업을 포함하며, 20개의 하위 범주를 아우르고 27개의 MCP 서버와 324개의 도구로 구성됩니다. MM-ToolBench의 핵심 설계는 폐루프(closed-loop) 멀티모달 검증입니다. 에이전트는 도구를 실행하고, 렌더링 또는 변환된 결과물을 검사하며, 출력이 작업별 요구 사항을 충족하지 못할 경우 스스로 수정해야 합니다. 이러한 평가를 확장 가능하고 검증 가능하게 만들기 위해 MM-ToolBench는 MCP 기반 실행과 작업별 기반 평가자(grounded evaluator), 그리고 시나리오 발견, 작업 구체화, 평가자 합성 및 인간 감사를 위한 반자동화된 구축 파이프라인을 결합합니다. 15개의 최신 에이전트 모델에 대한 실험 결과, MM-ToolBench는 여전히 높은 난이도를 유지하고 있습니다. 일반적으로 가장 강력한 코딩 에이전트 모델 중 하나로 여겨지는 Claude Opus 4.6조차 32.0%의 작업 성공률을 기록했으며, 이는 인간 기준 94.0%에 크게 미치지 못합니다. 우리는 MM-ToolBench가 폐루프 멀티모달 검증을 통해 차세대 전방위 도구 사용 에이전트를 평가하고 발전시키기 위한 실용적인 기반이 될 것이라고 기대합니다.

English

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.