ColPali: Recupero Efficiente di Documenti con Modelli di Visione e Linguaggio

Abstract

I documenti sono strutture visivamente ricche che veicolano informazioni attraverso il testo, così come tabelle, figure, layout di pagina o caratteri tipografici. Sebbene i moderni sistemi di recupero documenti mostrino prestazioni solide nell'abbinamento query-testo, faticano a sfruttare in modo efficiente gli indizi visivi, limitando le loro prestazioni in applicazioni pratiche di recupero documenti come il Retrieval Augmented Generation. Per valutare i sistemi attuali nel recupero di documenti visivamente ricchi, introduciamo il benchmark Visual Document Retrieval Benchmark ViDoRe, composto da varie attività di recupero a livello di pagina che abbracciano più domini, lingue e contesti. Le carenze intrinseche dei sistemi moderni motivano l'introduzione di una nuova architettura di modello di recupero, ColPali, che sfrutta le capacità di comprensione documentale dei recenti Vision Language Models per produrre embedding contestualizzati di alta qualità esclusivamente da immagini di pagine di documenti. Combinato con un meccanismo di abbinamento a interazione tardiva, ColPali supera ampiamente le pipeline moderne di recupero documenti, essendo drasticamente più veloce e addestrabile end-to-end.

English

Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

ColPali: Recupero Efficiente di Documenti con Modelli di Visione e Linguaggio

ColPali: Efficient Document Retrieval with Vision Language Models

Abstract

Support