视觉深度研究基准：重新思考多模态大语言模型的视觉与文本搜索

摘要

多模态大语言模型（MLLMs）已显著推进视觉问答技术的发展，并开始支持基于搜索引擎进行复杂图文事实查证的视觉深度研究系统。然而，评估这类视觉与文本检索能力仍面临挑战，现有基准存在两大局限：其一，现有基准未以视觉搜索为核心——本需视觉搜索的答案常通过文本问题的跨文本线索泄露，或可被当前MLLMs的先验知识推断；其二，评估场景过度理想化：图像搜索侧常可通过全图近精确匹配获取信息，而文本搜索侧则过于直接且挑战性不足。为解决这些问题，我们构建了包含2,000个视觉问答实例的视觉深度研究基准（VDR-Bench）。所有问题均通过多阶段精心筛选流程和严格专家评审创建，旨在评估视觉深度研究系统在真实场景下的表现。此外，针对当前MLLMs视觉检索能力不足的问题，我们提出一种简单的多轮裁剪搜索工作流。该策略被证实在真实视觉检索场景中能有效提升模型性能。总体而言，我们的研究结果为未来多模态深度研究系统的设计提供了实用指导。代码将发布于https://github.com/Osilly/Vision-DeepResearch。

English

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

视觉深度研究基准：重新思考多模态大语言模型的视觉与文本搜索

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

摘要

Support