비전 딥리서치 벤치마크: 멀티모달 대규모 언어 모델을 위한 시각 및 텍스트 검색 재고

초록

멀티모달 대규모 언어 모델(MLLM)의 발전으로 VQA(Visual Question Answering) 성능이 향상되었으며, 복잡한 시각-텍스트적 사실 탐색을 위해 검색 엔진을 활용하는 Vision-DeepResearch 시스템이 등장했습니다. 그러나 이러한 시각 및 텍스트 검색 능력을 평가하는 것은 여전히 어렵고, 기존 벤치마크에는 두 가지 주요 한계가 있습니다. 첫째, 기존 벤치마크는 시각 검색 중심이 아닙니다: 시각 검색이 필요한 답변이 텍스트 질문의 교차-텍스트 단서를 통해 누출되거나 현재 MLLM의 사전 세계 지식으로 추론될 수 있습니다. 둘째, 지나치게 이상화된 평가 시나리오: 이미지 검색 측면에서는 필요한 정보가 전체 이미지에 대한 거의 정확한 매칭을 통해 획득될 수 있는 반면, 텍스트 검색 측면은 지나치게 직접적이고 도전적이지 않습니다. 이러한 문제를 해결하기 위해 우리는 2,000개의 VQA 인스턴스로 구성된 Vision-DeepResearch 벤치마크(VDR-Bench)를 구축했습니다. 모든 질문은 신중하게 구성된 다단계 선별 과정과 엄격한 전문가 검토를 통해 생성되었으며, 실제 현실 세계 조건에서 Vision-DeepResearch 시스템의 동작을 평가하도록 설계되었습니다. 더 나아가, 현재 MLLM의 불충분한 시각 검색 능력을 해결하기 위해 간단한 다중 라운드 크롭-검색(cropped-search) 워크플로를 제안합니다. 이 전략은 실제 시각 검색 시나리오에서 모델 성능을 효과적으로 향상시키는 것으로 나타났습니다. 전반적으로, 우리의 결과는 향후 멀티모딥 딥리서치 시스템 설계를 위한 실용적인 지침을 제공합니다. 코드는 https://github.com/Osilly/Vision-DeepResearch 에 공개될 예정입니다.

English

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

비전 딥리서치 벤치마크: 멀티모달 대규모 언어 모델을 위한 시각 및 텍스트 검색 재고

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

초록

Support