MM-BrowseComp:多模態瀏覽代理的全面基準測試
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
August 14, 2025
作者: Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
cs.AI
摘要
具備先進推理與工具使用能力的AI代理在深度網路搜尋中展現了令人印象深刻的表現。儘管現有的基準測試如BrowseComp評估了這些瀏覽能力,但它們主要聚焦於文本資訊,忽略了多模態內容的普遍性。為彌補這一差距,我們引入了MM-BrowseComp,這是一個包含224個精心設計的挑戰性問題的新基準,專門用於評估代理的多模態檢索與推理能力。這些問題常在提示中融入圖像,且搜尋與推理過程中遇到的關鍵資訊也可能嵌入網頁中的圖像或影片。因此,僅依賴文本的方法在我們的基準測試中顯得不足。此外,我們為每個問題提供了經過驗證的檢查清單,使得多模態依賴性與推理路徑的細緻分析成為可能。我們對MM-BrowseComp上最先進模型的全面評估顯示,即使是像OpenAI o3這樣配備工具的頂尖模型,其準確率也僅達到29.02%,凸顯了當前模型在多模態能力上的不足以及缺乏原生多模態推理的現狀。
English
AI agents with advanced reasoning and tool use capabilities have demonstrated
impressive performance in web browsing for deep search. While existing
benchmarks such as BrowseComp evaluate these browsing abilities, they primarily
focus on textual information, overlooking the prevalence of multimodal content.
To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising
224 challenging, hand-crafted questions specifically designed to assess agents'
multimodal retrieval and reasoning capabilities. These questions often
incorporate images in prompts, and crucial information encountered during the
search and reasoning process may also be embedded within images or videos on
webpages. Consequently, methods relying solely on text prove insufficient for
our benchmark. Additionally, we provide a verified checklist for each question,
enabling fine-grained analysis of multimodal dependencies and reasoning paths.
Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp
reveals that even top models like OpenAI o3 with tools achieve only 29.02\%
accuracy, highlighting the suboptimal multimodal capabilities and lack of
native multimodal reasoning in current models.