VisR-Bench:多语言长文档理解中视觉检索增强生成的实证研究
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
August 10, 2025
作者: Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, Ruiyi Zhang
cs.AI
摘要
全球大多数组织数据以文档形式存储,视觉检索在从这些文档中挖掘集体智慧方面发挥着关键作用。然而,现有基准测试仅关注英文文档检索,或仅在单页图像上进行多语言问答。为填补这一空白,我们推出了VisR-Bench,这是一个专为长文档中问题驱动的多模态检索而设计的多语言基准测试。我们的基准包含超过35,000个高质量问答对,覆盖1,200份文档,支持对多模态检索进行细粒度评估。VisR-Bench涵盖十六种语言,包含三种问题类型(图表、文本和表格),提供了多样化的语言和问题覆盖范围。与以往数据集不同,我们引入了无明确答案的查询,防止模型依赖表面的关键词匹配。我们评估了多种检索模型,包括基于文本的方法、多模态编码器和多模态大语言模型(MLLMs),揭示了它们的优势与局限。结果显示,尽管MLLMs显著优于基于文本和多模态编码器的模型,但在处理结构化表格和低资源语言时仍面临挑战,凸显了多语言视觉检索中的关键难题。
English
Most organizational data in this world are stored as documents, and visual
retrieval plays a crucial role in unlocking the collective intelligence from
all these documents. However, existing benchmarks focus on English-only
document retrieval or only consider multilingual question-answering on a
single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual
benchmark designed for question-driven multimodal retrieval in long documents.
Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents,
enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans
sixteen languages with three question types (figures, text, and tables),
offering diverse linguistic and question coverage. Unlike prior datasets, we
include queries without explicit answers, preventing models from relying on
superficial keyword matching. We evaluate various retrieval models, including
text-based methods, multimodal encoders, and MLLMs, providing insights into
their strengths and limitations. Our results show that while MLLMs
significantly outperform text-based and multimodal encoder models, they still
struggle with structured tables and low-resource languages, highlighting key
challenges in multilingual visual retrieval.