M3-Bench:多模态、多跳、多线程工具调用型多模态大语言模型智能体基准测试框架
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
November 21, 2025
作者: Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, Dimitris N. Metaxas
cs.AI
摘要
我们推出M^3-Bench——首个基于模型上下文协议评估多模态工具使用的基准测试。该基准针对需要视觉定位与文本推理、跨工具依赖关系以及中间资源跨步骤持久化的现实多跳多线程工作流。我们提出相似度驱动的对齐方法:序列化每个工具调用,通过句子编码器嵌入函数签名,并执行相似度分桶的匈牙利匹配以获得可审计的一一对应关系。在此对齐基础上,我们报告可解释的度量指标,将语义保真度与工作流一致性进行解耦分析。该基准涵盖28个服务器共231种工具,通过经过人工验证的执行器-评判器流水线提供标准化轨迹;辅助性四大型语言模型评审团集成报告最终任务完成度与信息锚定度。对代表性前沿多模态大模型的评估显示,其在多模态MCP工具使用方面存在持续性短板,尤其在参数保真度和结构一致性上,这表明需要开发能联合推理图像、文本与工具图的新方法。本基准匿名代码库位于https://github.com/EtaYang10th/Open-M3-Bench。
English
We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench