Vision-DeepResearchベンチマーク：マルチモーダル大規模言語モデルにおける視覚的・テキスト的検索の再考

要旨

マルチモーダル大規模言語モデル（MLLM）はVQAを進化させ、検索エンジンを活用した複合的な視覚・テキスト情報探索システム「Vision-DeepResearch」を実現しました。しかし、これらの視覚的・テキスト的検索能力を評価する手法は未確立であり、既存ベンチマークには2つの重大な限界があります。第一に、既存ベンチマークは視覚検索中心ではない点です。視覚検索を要するべき回答が、テキスト質問中のクロステキストualな手がかりから漏洩したり、現行MLLMの事前世界知識で推論可能になったりします。第二に、評価シナリオが過度に理想化されている点です。画像検索側では必要な情報が画像全体との完全一致で得られる場合が多く、テキスト検索側では質問が直接的で難易度不足です。これらの課題を解決するため、我々は2,000のVQAインスタンスから構成される「Vision-DeepResearchベンチマーク（VDR-Bench）」を構築しました。全ての質問は厳格な多段階選定プロセスと専門家審査を経て作成され、現実世界の条件下でのVision-DeepResearchシステムの挙動を評価できる設計となっています。さらに、現行MLLMの不十分な視覚検索能力に対処するため、簡易なマルチラウンド部分画像検索ワークフローを提案します。この戦略が現実的な視覚検索シナリオにおけるモデル性能を効果的に向上させることが実証されました。総合的に、我々の成果は将来のマルチモーダル深層探索システムの設計に実用的な指針を提供します。コードはhttps://github.com/Osilly/Vision-DeepResearch で公開予定です。

English

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

Vision-DeepResearchベンチマーク：マルチモーダル大規模言語モデルにおける視覚的・テキスト的検索の再考

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

要旨

Support