DocGraphLM：用於資訊提取的文件圖語言模型

摘要

在視覺豐富文件理解（VrDU）方面的進展已經實現了對具有複雜版面的文件進行信息提取和問答。出現了兩種架構的模式--受LLM啟發的基於Transformer的模型和圖神經網絡。在本文中，我們介紹了DocGraphLM，一個結合了預訓練語言模型和圖語義的新框架。為了實現這一目標，我們提出了1）一種聯合編碼器架構來表示文件，以及2）一種新的鏈接預測方法來重構文件圖。DocGraphLM使用一種收斂的聯合損失函數來預測節點之間的方向和距離，該函數優先考慮鄰域恢復並降低遠程節點檢測的權重。我們在三個最先進的數據集上進行的實驗表明，採用圖特徵在信息提取和問答任務上實現了一致的改進。此外，我們報告說，儘管僅通過鏈接預測構建，但採用圖特徵加速了訓練過程中的收斂。

English

Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.

DocGraphLM：用於資訊提取的文件圖語言模型

DocGraphLM: Documental Graph Language Model for Information Extraction

摘要

Support