DocLLM：一種考慮版面的多模態文件理解生成式語言模型

摘要

企業文件，如表格、發票、收據、報告、合同等類似記錄，通常在文本和空間模式的交集處具有豐富的語義。它們複雜的版面設計提供的視覺線索在有效理解這些文件方面起著至關重要的作用。在本文中，我們提出了DocLLM，這是傳統大型語言模型（LLMs）的一個輕量級擴展，用於推理視覺文件，同時考慮文本語義和空間版面。我們的模型與現有的多模態LLMs不同，它避免使用昂貴的圖像編碼器，並專注於利用邊界框信息來納入空間版面結構。具體來說，文本和空間模式之間的交叉對齊是通過將傳統變壓器中的注意機制分解為一組解耦矩陣來捕捉的。此外，我們制定了一個預訓練目標，學習填充文本片段。這種方法使我們能夠應對視覺文件中經常遇到的不規則版面和異構內容。預訓練模型使用一個大規模指令數據集進行微調，該數據集涵蓋四個核心文件智能任務。我們展示了我們的解決方案在所有任務的16個數據集中有14個超越了現有技術的LLMs，並且對於以前未見過的5個數據集中的4個有很好的泛化能力。

English

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

DocLLM：一種考慮版面的多模態文件理解生成式語言模型

DocLLM: A layout-aware generative language model for multimodal document understanding

摘要

Support