ColPali:利用視覺語言模型進行高效文件檢索
ColPali: Efficient Document Retrieval with Vision Language Models
June 27, 2024
作者: Manuel Faysse, Hugues Sibille, Tony Wu, Gautier Viaud, Céline Hudelot, Pierre Colombo
cs.AI
摘要
文件是透過文字、表格、圖片、頁面設計或字體來傳達信息的視覺豐富結構。儘管現代文件檢索系統在查詢與文本匹配方面表現出色,但它們在高效利用視覺線索方面仍存在困難,這影響了它們在實際文件檢索應用(如檢索增強生成)中的性能。為了對視覺豐富文件檢索中的現有系統進行基準測試,我們引入了名為ViDoRe的視覺文件檢索基準,包括跨多個領域、語言和設置的各種頁面級檢索任務。現代系統的固有缺陷促使引入一種新的檢索模型架構ColPali,該模型利用最近的視覺語言模型的文件理解能力,僅從文件頁面的圖像中生成高質量的情境化嵌入。結合後期交互匹配機制,ColPali在很大程度上優於現代文件檢索流程,同時速度大幅提升且可端到端進行訓練。
English
Documents are visually rich structures that convey information through text,
as well as tables, figures, page layouts, or fonts. While modern document
retrieval systems exhibit strong performance on query-to-text matching, they
struggle to exploit visual cues efficiently, hindering their performance on
practical document retrieval applications such as Retrieval Augmented
Generation. To benchmark current systems on visually rich document retrieval,
we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of
various page-level retrieving tasks spanning multiple domains, languages, and
settings. The inherent shortcomings of modern systems motivate the introduction
of a new retrieval model architecture, ColPali, which leverages the document
understanding capabilities of recent Vision Language Models to produce
high-quality contextualized embeddings solely from images of document pages.
Combined with a late interaction matching mechanism, ColPali largely
outperforms modern document retrieval pipelines while being drastically faster
and end-to-end trainable.Summary
AI-Generated Summary