BrowseComp-V^3:面向多模态浏览代理的视觉化、垂直化与可验证基准平台
BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
February 13, 2026
作者: Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Dan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang, Bin Cui
cs.AI
摘要
配备日益先进的规划与工具使用能力的多模态大语言模型,正逐步演变为能够在开放世界环境中执行多模态网络浏览与深度搜索的自主智能体。然而,现有多模态浏览基准在任务复杂度、证据可获取性及评估粒度方面仍存在局限,难以实现深度搜索能力的全面可复现评估。为此,我们推出BrowseComp-V^3——一个包含300个精心设计的跨领域高难度问题的新型基准。该基准强调深层、多层级、跨模态的多跳推理,关键证据交错分布于网页内及跨网页的文本与视觉模态中,且所有支撑证据严格限定为公开可检索内容,确保公平性与可复现性。除最终答案准确率外,我们引入经专家验证的子目标驱动流程评估机制,支持对中间推理行为进行细粒度分析及能力边界的系统性刻画。此外,我们提出OmniSeeker统一多模态浏览智能体框架,整合多样化网络搜索与视觉感知工具。综合实验表明,即使最先进模型在本基准上的准确率仅为36%,揭示了多模态信息整合与细粒度感知方面的关键瓶颈。研究结果凸显出现有模型能力与现实场景中鲁棒性多模态深度搜索之间的根本性差距。
English
Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp-V^3, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities within and across web pages. All supporting evidence is strictly required to be publicly searchable, ensuring fairness and reproducibility. Beyond final-answer accuracy, we incorporate an expert-validated, subgoal-driven process evaluation mechanism that enables fine-grained analysis of intermediate reasoning behaviors and systematic characterization of capability boundaries. In addition, we propose OmniSeeker, a unified multimodal browsing agent framework integrating diverse web search and visual perception tools. Comprehensive experiments demonstrate that even state-of-the-art models achieve only 36% accuracy on our benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.