MM-BrowseComp:多模态浏览代理的综合基准测试平台
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
August 14, 2025
作者: Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
cs.AI
摘要
具备高级推理与工具使用能力的AI代理在深度网络搜索中展现了卓越性能。尽管现有基准如BrowseComp评估了这些浏览能力,但它们主要关注文本信息,忽视了多模态内容的普遍性。为填补这一空白,我们推出了MM-BrowseComp,一个包含224道精心设计的挑战性问题的全新基准,专门用于评估代理的多模态检索与推理能力。这些问题常在提示中融入图像,且搜索与推理过程中遇到的关键信息也可能嵌入网页中的图像或视频内。因此,仅依赖文本的方法在我们的基准测试中显得力不从心。此外,我们为每个问题提供了经过验证的检查清单,支持对多模态依赖关系及推理路径的细致分析。通过对顶尖模型在MM-BrowseComp上的全面评估,我们发现即便是配备了工具的OpenAI o3等顶级模型,准确率也仅为29.02%,这凸显了当前模型在多模态能力上的不足及原生多模态推理的缺失。
English
AI agents with advanced reasoning and tool use capabilities have demonstrated
impressive performance in web browsing for deep search. While existing
benchmarks such as BrowseComp evaluate these browsing abilities, they primarily
focus on textual information, overlooking the prevalence of multimodal content.
To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising
224 challenging, hand-crafted questions specifically designed to assess agents'
multimodal retrieval and reasoning capabilities. These questions often
incorporate images in prompts, and crucial information encountered during the
search and reasoning process may also be embedded within images or videos on
webpages. Consequently, methods relying solely on text prove insufficient for
our benchmark. Additionally, we provide a verified checklist for each question,
enabling fine-grained analysis of multimodal dependencies and reasoning paths.
Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp
reveals that even top models like OpenAI o3 with tools achieve only 29.02\%
accuracy, highlighting the suboptimal multimodal capabilities and lack of
native multimodal reasoning in current models.