DocLLM: マルチモーダル文書理解のためのレイアウト認識生成言語モデル

要旨

企業文書、例えばフォーム、請求書、領収書、レポート、契約書、その他類似の記録は、テキストと空間的モダリティの交差点において豊かな意味論を有していることが多い。これらの文書を効果的に理解する上で、複雑なレイアウトが提供する視覚的な手がかりは重要な役割を果たす。本論文では、テキストの意味論と空間的レイアウトの両方を考慮した視覚的文書の推論を行うための、従来の大規模言語モデル（LLM）に対する軽量な拡張であるDocLLMを提案する。我々のモデルは、既存のマルチモーダルLLMとは異なり、高価な画像エンコーダを避け、空間的レイアウト構造を取り入れるためにバウンディングボックス情報にのみ焦点を当てている。具体的には、テキストと空間的モダリティ間のクロスアラインメントは、古典的なトランスフォーマーのアテンションメカニズムを一連の分離された行列に分解することで捕捉される。さらに、テキストセグメントを埋めることを学習する事前学習目標を考案した。このアプローチにより、視覚的文書で頻繁に遭遇する不規則なレイアウトや異種コンテンツに対処することが可能となる。事前学習されたモデルは、4つのコアな文書インテリジェンスタスクをカバーする大規模な指示データセットを使用してファインチューニングされる。我々のソリューションは、全タスクにわたる16のデータセットのうち14でSotA LLMを上回り、以前に見たことのない5つのデータセットのうち4つにうまく一般化することを実証する。

English

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

DocLLM: マルチモーダル文書理解のためのレイアウト認識生成言語モデル

DocLLM: A layout-aware generative language model for multimodal document understanding

要旨

Support