LMDX: 言語モデルに基づく文書情報抽出と位置特定

要旨

大規模言語モデル（LLM）は自然言語処理（NLP）に革命をもたらし、多くの既存タスクにおいて最先端の性能を向上させ、新たな能力を発現させてきた。しかし、LLMは半構造化ドキュメントからの情報抽出にはまだ成功しておらず、これは多くのドキュメント処理ワークフローの核心をなすもので、視覚的にリッチなドキュメント（VRD）から所定のターゲットスキーマに基づいて主要なエンティティを抽出する作業である。このタスクにおけるLLMの採用を妨げてきた主な障壁は、高品質な抽出に不可欠なレイアウト情報のエンコードがLLMに欠けていること、および回答が虚構ではないことを保証するグラウンディング機構の欠如であった。本論文では、任意のLLMをドキュメント情報抽出に適応させるための方法論であるLanguage Model-based Document Information Extraction and Localization（LMDX）を紹介する。LMDXは、単一、繰り返し、階層的なエンティティの抽出を、トレーニングデータの有無にかかわらず行うことができ、グラウンディング保証を提供し、ドキュメント内でのエンティティの位置情報を特定する。特に、LMDXをPaLM 2-S LLMに適用し、VRDUおよびCORDベンチマークで評価を行い、新たな最先端の性能を確立し、LMDXが高品質でデータ効率の良いパーサーの作成を可能にすることを示す。

English

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

LMDX: 言語モデルに基づく文書情報抽出と位置特定

LMDX: Language Model-based Document Information Extraction and Localization

要旨

Support