UNIDOC-BENCH: 문서 중심 멀티모달 RAG를 위한 통합 벤치마크

초록

멀티모달 검색 증강 생성(MM-RAG)은 대규모 언어 모델(LLMs)과 에이전트를 실제 지식 기반에 적용하기 위한 핵심 접근법이지만, 현재의 평가는 텍스트나 이미지를 개별적으로 다루거나, 문서 중심의 멀티모달 사용 사례를 충분히 반영하지 못하는 단순화된 멀티모달 설정에 초점을 맞추고 있어 단편적입니다. 본 논문에서는 8개 도메인에 걸친 70,000개의 실제 PDF 페이지로부터 구축된 첫 번째 대규모 현실적 벤치마크인 UniDoc-Bench를 소개합니다. 우리의 파이프라인은 텍스트, 표, 그림으로부터 증거를 추출하고 연결한 후, 사실 검색, 비교, 요약, 논리적 추론 질문을 아우르는 1,600개의 멀티모달 QA 쌍을 생성합니다. 신뢰성을 보장하기 위해 QA 쌍의 20%는 다중 주석자와 전문가 중재를 통해 검증되었습니다. UniDoc-Bench는 (1) 텍스트 전용, (2) 이미지 전용, (3) 멀티모달 텍스트-이미지 융합, (4) 멀티모달 공동 검색이라는 네 가지 패러다임을 표준화된 후보 풀, 프롬프트, 평가 지표를 통해 동일한 조건에서 비교할 수 있도록 지원합니다. 우리의 실험 결과, 멀티모달 텍스트-이미지 융합 RAG 시스템은 단일 모달 및 공동 멀티모달 임베딩 기반 검색을 지속적으로 능가하며, 텍스트나 이미지 단독으로는 충분하지 않고 현재의 멀티모달 임베딩도 여전히 부족함을 보여줍니다. 벤치마킹을 넘어, 우리의 분석은 시각적 맥락이 텍스트 증거를 보완하는 시점과 방법을 밝히고, 체계적인 실패 모드를 발견하며, 더 견고한 MM-RAG 파이프라인 개발을 위한 실행 가능한 지침을 제공합니다.

English

Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval -- under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

UNIDOC-BENCH: 문서 중심 멀티모달 RAG를 위한 통합 벤치마크

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

초록

Support