ChatPaper.aiChatPaper

視覺深度研究基準:重新思考多模態大型語言模型的視覺與文本搜索

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

February 2, 2026
作者: Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao
cs.AI

摘要

多模態大型語言模型(MLLMs)已推動視覺問答(VQA)的發展,並能支援利用搜尋引擎進行複雜圖文事實查證的視覺深度研究系統。然而,評估這類視覺與文本搜尋能力仍面臨挑戰,現有基準測試存在兩大侷限:其一,現有基準並非以視覺搜尋為核心——本需透過視覺搜尋獲取的答案,往往能從文本問題的跨線索中洩露,或可依賴當前MLLMs的先驗世界知識推斷得出;其二,評估情境過於理想化:圖像搜尋端常可透過全圖像近似的精確匹配獲取所需資訊,而文本搜尋端則過於直接且缺乏挑戰性。為解決這些問題,我們建構了包含2,000個VQA實例的視覺深度研究基準(VDR-Bench)。所有問題均透過嚴謹的多階段篩選流程與專家審核創建,旨在評估視覺深度研究系統在真實場景下的表現。此外,針對當前MLLMs視覺檢索能力不足的問題,我們提出一種簡潔的多輪裁剪搜尋工作流程,該策略被證實能有效提升模型在真實視覺檢索情境中的效能。總體而言,我們的研究成果為未來多模態深度研究系統的設計提供了實用指引。程式碼將發佈於https://github.com/Osilly/Vision-DeepResearch。
English
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.
PDF1153March 12, 2026