BrowseComp-V^3:面向多模态浏览代理的视觉化、垂直领域与可验证基准
BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
February 13, 2026
作者: Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Dan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang, Bin Cui
cs.AI
摘要
伴隨著日益精進的規劃與工具使用能力,多模態大語言模型(MLLMs)正逐步演進為能在開放式環境中執行多模態網路瀏覽與深度搜索的自主智能體。然而,現有的多模態瀏覽基準在任務複雜度、證據可獲取性及評估細粒度方面仍存在局限,難以對深度搜索能力進行全面且可重現的評估。為解決這些不足,我們推出BrowseComp-V^3——一個包含300道經精心設計、橫跨多領域的挑戰性問題的新型基準。該基準強調深層次、多級別及跨模態的多跳推理,關鍵證據分散在網頁內外文本與視覺模態的交織資訊中。所有支撐證據均嚴格要求可公開檢索,以確保公平性與可復現性。除最終答案準確率外,我們引入經專家驗證的子目標驅動式流程評估機制,實現對中間推理行為的細粒度分析與能力邊界的系統化刻畫。此外,我們提出OmniSeeker這一整合多種網路搜索與視覺感知工具的統一多模態瀏覽智能體框架。綜合實驗表明,即使最先進的模型在我們基準上的準確率僅達36%,揭示了多模態資訊整合與細粒度感知方面的關鍵瓶頸。研究結果凸顯出現有模型能力與真實場景下魯棒的多模態深度搜索之間存在根本性差距。
English
Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp-V^3, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities within and across web pages. All supporting evidence is strictly required to be publicly searchable, ensuring fairness and reproducibility. Beyond final-answer accuracy, we incorporate an expert-validated, subgoal-driven process evaluation mechanism that enables fine-grained analysis of intermediate reasoning behaviors and systematic characterization of capability boundaries. In addition, we propose OmniSeeker, a unified multimodal browsing agent framework integrating diverse web search and visual perception tools. Comprehensive experiments demonstrate that even state-of-the-art models achieve only 36% accuracy on our benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.