視覺深度研究基準：重新思考多模態大型語言模型的視覺與文本搜索

摘要

多模態大型語言模型（MLLMs）已推動視覺問答（VQA）的發展，並能支援利用搜尋引擎進行複雜圖文事實查證的視覺深度研究系統。然而，評估這類視覺與文本搜尋能力仍面臨挑戰，現有基準測試存在兩大侷限：其一，現有基準並非以視覺搜尋為核心——本需透過視覺搜尋獲取的答案，往往能從文本問題的跨線索中洩露，或可依賴當前MLLMs的先驗世界知識推斷得出；其二，評估情境過於理想化：圖像搜尋端常可透過全圖像近似的精確匹配獲取所需資訊，而文本搜尋端則過於直接且缺乏挑戰性。為解決這些問題，我們建構了包含2,000個VQA實例的視覺深度研究基準（VDR-Bench）。所有問題均透過嚴謹的多階段篩選流程與專家審核創建，旨在評估視覺深度研究系統在真實場景下的表現。此外，針對當前MLLMs視覺檢索能力不足的問題，我們提出一種簡潔的多輪裁剪搜尋工作流程，該策略被證實能有效提升模型在真實視覺檢索情境中的效能。總體而言，我們的研究成果為未來多模態深度研究系統的設計提供了實用指引。程式碼將發佈於https://github.com/Osilly/Vision-DeepResearch。

English

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

視覺深度研究基準：重新思考多模態大型語言模型的視覺與文本搜索

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

摘要

Support