DocGraphLM：用于信息提取的文档图语言模型

摘要

在视觉丰富文档理解（VrDU）方面取得的进展已经实现了对具有复杂布局的文档进行信息提取和问题回答。出现了两种架构的范式——受LLM启发的基于Transformer的模型和图神经网络。在本文中，我们介绍了DocGraphLM，这是一个将预训练语言模型与图语义相结合的新框架。为了实现这一目标，我们提出了1）一个联合编码器架构来表示文档，以及2）一种新颖的链接预测方法来重建文档图。DocGraphLM使用一个收敛的联合损失函数来预测节点之间的方向和距离，该损失函数优先考虑邻域恢复并降低远程节点检测的权重。我们在三个最先进数据集上的实验表明，在采用图特征的情况下，IE和QA任务的性能始终有所提升。此外，我们报告称，尽管仅通过链接预测构建，但采用图特征加速了学习过程中的收敛。

English

Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.

DocGraphLM：用于信息提取的文档图语言模型

DocGraphLM: Documental Graph Language Model for Information Extraction

摘要

Support