ColPali：利用视觉语言模型实现高效文档检索

摘要

文档是通过文本、表格、图表、页面布局或字体传达信息的视觉丰富结构。虽然现代文档检索系统在查询与文本匹配方面表现出色，但它们在高效利用视觉线索方面存在困难，从而影响了它们在实际文档检索应用（如检索增强生成）中的性能。为了对视觉丰富文档检索的当前系统进行基准测试，我们引入了视觉文档检索基准ViDoRe，包括跨多个领域、语言和设置的各种页面级检索任务。现代系统的固有缺陷促使引入一种新的检索模型架构ColPali，它利用最近的视觉语言模型的文档理解能力，仅从文档页面的图像中生成高质量的上下文嵌入。结合后期交互匹配机制，ColPali在很大程度上优于现代文档检索流程，同时速度大大提高且端到端可训练。

English

Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

ColPali：利用视觉语言模型实现高效文档检索

ColPali: Efficient Document Retrieval with Vision Language Models

摘要

Support