MM-BrowseComp：多模态浏览代理的综合基准测试平台

摘要

具备高级推理与工具使用能力的AI代理在深度网络搜索中展现了卓越性能。尽管现有基准如BrowseComp评估了这些浏览能力，但它们主要关注文本信息，忽视了多模态内容的普遍性。为填补这一空白，我们推出了MM-BrowseComp，一个包含224道精心设计的挑战性问题的全新基准，专门用于评估代理的多模态检索与推理能力。这些问题常在提示中融入图像，且搜索与推理过程中遇到的关键信息也可能嵌入网页中的图像或视频内。因此，仅依赖文本的方法在我们的基准测试中显得力不从心。此外，我们为每个问题提供了经过验证的检查清单，支持对多模态依赖关系及推理路径的细致分析。通过对顶尖模型在MM-BrowseComp上的全面评估，我们发现即便是配备了工具的OpenAI o3等顶级模型，准确率也仅为29.02%，这凸显了当前模型在多模态能力上的不足及原生多模态推理的缺失。

English

AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

MM-BrowseComp：多模态浏览代理的综合基准测试平台

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

摘要

Support