DocGraphLM: 情報抽出のためのドキュメントグラフ言語モデル

要旨

視覚的にリッチな文書理解（VrDU）の進展により、複雑なレイアウトを持つ文書からの情報抽出や質問応答が可能になりました。これまでに、LLM（大規模言語モデル）に着想を得たトランスフォーマーベースのモデルと、グラフニューラルネットワークという2つのアーキテクチャの潮流が生まれています。本論文では、事前学習済み言語モデルとグラフ意味論を組み合わせた新しいフレームワーク「DocGraphLM」を紹介します。これを実現するために、1) 文書を表現するための共同エンコーダーアーキテクチャ、および2) 文書グラフを再構築するための新しいリンク予測手法を提案します。DocGraphLMは、近傍の復元を優先し、遠くのノード検出を軽視する収束型の共同損失関数を使用して、ノード間の方向と距離の両方を予測します。3つの最先端データセットでの実験により、グラフ特徴の採用が情報抽出（IE）や質問応答（QA）タスクで一貫した改善をもたらすことが示されました。さらに、グラフ特徴の採用が、リンク予測のみを通じて構築されているにもかかわらず、学習プロセス中の収束を加速させることも報告しています。

English

Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.

DocGraphLM: 情報抽出のためのドキュメントグラフ言語モデル

DocGraphLM: Documental Graph Language Model for Information Extraction

要旨

Support