LMDX : Extraction et localisation d'informations documentaires basées sur un modèle de langage

Résumé

Les modèles de langage de grande taille (LLM) ont révolutionné le traitement du langage naturel (NLP), améliorant l'état de l'art sur de nombreuses tâches existantes et démontrant des capacités émergentes. Cependant, les LLM n'ont pas encore été appliqués avec succès à l'extraction d'informations à partir de documents semi-structurés, une tâche centrale dans de nombreux flux de traitement de documents qui consiste à extraire des entités clés d'un document visuellement riche (VRD) selon un schéma cible prédéfini. Les principaux obstacles à l'adoption des LLM pour cette tâche sont l'absence d'encodage de la mise en page dans les LLM, essentiel pour une extraction de haute qualité, et le manque d'un mécanisme d'ancrage garantissant que la réponse n'est pas hallucinée. Dans cet article, nous présentons LMDX (Language Model-based Document Information Extraction and Localization), une méthodologie pour adapter des LLM arbitraires à l'extraction d'informations documentaires. LMDX permet l'extraction d'entités singulières, répétées et hiérarchiques, avec ou sans données d'entraînement, tout en fournissant des garanties d'ancrage et en localisant les entités dans le document. En particulier, nous appliquons LMDX au LLM PaLM 2-S et l'évaluons sur les benchmarks VRDU et CORD, établissant un nouvel état de l'art et montrant comment LMDX permet la création d'analyseurs de haute qualité et efficaces en termes de données.

English

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

LMDX : Extraction et localisation d'informations documentaires basées sur un modèle de langage

LMDX: Language Model-based Document Information Extraction and Localization

Résumé

Support