DocLLM：一种面向布局的生成式多模态文档理解语言模型

摘要

企业文档，如表格、发票、收据、报告、合同等类似记录，通常在文本和空间模态的交集处携带丰富的语义。它们复杂的布局所提供的视觉线索在有效理解这些文档中起着关键作用。在本文中，我们介绍了DocLLM，这是传统大型语言模型（LLMs）的一个轻量级扩展，用于推理视觉文档，同时考虑文本语义和空间布局。我们的模型与现有的多模态LLMs不同，它避免了昂贵的图像编码器，专注于利用边界框信息来融合空间布局结构。具体来说，文本和空间模态之间的交叉对齐是通过将经典Transformer中的注意机制分解为一组解耦矩阵来捕获的。此外，我们设计了一个学习填充文本片段的预训练目标。这种方法使我们能够处理在视觉文档中经常遇到的不规则布局和异构内容。经过大规模指令数据集的微调，我们展示了我们的解决方案在所有任务的16个数据集中有14个超越了SotA LLMs，并且在以前未见的5个数据集中有4个表现良好。

English

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

DocLLM：一种面向布局的生成式多模态文档理解语言模型

DocLLM: A layout-aware generative language model for multimodal document understanding

摘要

Support