LMDX: Estrazione e Localizzazione delle Informazioni nei Documenti basata su Modelli Linguistici

Abstract

I Large Language Model (LLM) hanno rivoluzionato il Natural Language Processing (NLP), migliorando lo stato dell'arte in molti compiti esistenti e dimostrando capacità emergenti. Tuttavia, i LLM non sono ancora stati applicati con successo all'estrazione di informazioni da documenti semi-strutturati, che è al centro di molti flussi di lavoro di elaborazione documentale e consiste nell'estrarre entità chiave da un documento visivamente ricco (VRD) in base a uno schema target predefinito. I principali ostacoli all'adozione dei LLM in questo compito sono stati l'assenza di codifica del layout all'interno dei LLM, fondamentale per un'estrazione di alta qualità, e la mancanza di un meccanismo di ancoraggio che garantisca che la risposta non sia allucinata. In questo articolo, introduciamo il Language Model-based Document Information Extraction and Localization (LMDX), una metodologia per adattare LLM arbitrari all'estrazione di informazioni da documenti. LMDX è in grado di estrarre entità singole, ripetute e gerarchiche, sia con che senza dati di addestramento, fornendo garanzie di ancoraggio e localizzando le entità all'interno del documento. In particolare, applichiamo LMDX al LLM PaLM 2-S e lo valutiamo sui benchmark VRDU e CORD, stabilendo un nuovo stato dell'arte e dimostrando come LMDX consenta la creazione di parser di alta qualità ed efficienti dal punto di vista dei dati.

English

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

LMDX: Estrazione e Localizzazione delle Informazioni nei Documenti basata su Modelli Linguistici

LMDX: Language Model-based Document Information Extraction and Localization

Abstract

Support