MM-BrowseComp：多模態瀏覽代理的全面基準測試

摘要

具備先進推理與工具使用能力的AI代理在深度網路搜尋中展現了令人印象深刻的表現。儘管現有的基準測試如BrowseComp評估了這些瀏覽能力，但它們主要聚焦於文本資訊，忽略了多模態內容的普遍性。為彌補這一差距，我們引入了MM-BrowseComp，這是一個包含224個精心設計的挑戰性問題的新基準，專門用於評估代理的多模態檢索與推理能力。這些問題常在提示中融入圖像，且搜尋與推理過程中遇到的關鍵資訊也可能嵌入網頁中的圖像或影片。因此，僅依賴文本的方法在我們的基準測試中顯得不足。此外，我們為每個問題提供了經過驗證的檢查清單，使得多模態依賴性與推理路徑的細緻分析成為可能。我們對MM-BrowseComp上最先進模型的全面評估顯示，即使是像OpenAI o3這樣配備工具的頂尖模型，其準確率也僅達到29.02%，凸顯了當前模型在多模態能力上的不足以及缺乏原生多模態推理的現狀。

English

AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

MM-BrowseComp：多模態瀏覽代理的全面基準測試

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

摘要

Support