LMDX：基于语言模型的文档信息提取与定位

摘要

大型语言模型（LLM）已经彻底改变了自然语言处理（NLP），在许多现有任务上改进了最先进技术，并展示了新兴能力。然而，LLM尚未成功应用于半结构化文档信息提取，这是许多文档处理工作流程的核心，包括从视觉丰富的文档（VRD）中提取关键实体，给定预定义的目标模式。LLM在该任务中应用的主要障碍是LLM内部缺乏布局编码，这对于高质量提取至关重要，并且缺乏确保答案不是虚构的基础机制。在本文中，我们介绍了基于语言模型的文档信息提取和定位（LMDX）方法，用于调整任意LLM以进行文档信息提取。LMDX可以提取单个、重复和分层实体，无论是否有训练数据，同时提供基础保证并定位文档中的实体。特别是，我们将LMDX应用于PaLM 2-S LLM，并在VRDU和CORD基准测试上进行评估，树立了新的最先进技术，并展示了LMDX如何实现高质量、数据高效的解析器的创建。

English

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

LMDX：基于语言模型的文档信息提取与定位

LMDX: Language Model-based Document Information Extraction and Localization

摘要

Support