VisR-Bench：多语言长文档理解中视觉检索增强生成的实证研究

摘要

全球大多数组织数据以文档形式存储，视觉检索在从这些文档中挖掘集体智慧方面发挥着关键作用。然而，现有基准测试仅关注英文文档检索，或仅在单页图像上进行多语言问答。为填补这一空白，我们推出了VisR-Bench，这是一个专为长文档中问题驱动的多模态检索而设计的多语言基准测试。我们的基准包含超过35,000个高质量问答对，覆盖1,200份文档，支持对多模态检索进行细粒度评估。VisR-Bench涵盖十六种语言，包含三种问题类型（图表、文本和表格），提供了多样化的语言和问题覆盖范围。与以往数据集不同，我们引入了无明确答案的查询，防止模型依赖表面的关键词匹配。我们评估了多种检索模型，包括基于文本的方法、多模态编码器和多模态大语言模型（MLLMs），揭示了它们的优势与局限。结果显示，尽管MLLMs显著优于基于文本和多模态编码器的模型，但在处理结构化表格和低资源语言时仍面临挑战，凸显了多语言视觉检索中的关键难题。

English

Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.

VisR-Bench：多语言长文档理解中视觉检索增强生成的实证研究

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

摘要

Support