DocGraphLM: 정보 추출을 위한 문서 그래프 언어 모델

초록

시각적으로 풍부한 문서 이해(Visually Rich Document Understanding, VrDU) 분야의 발전은 복잡한 레이아웃을 가진 문서에 대한 정보 추출 및 질문 응답을 가능하게 하였습니다. 이와 관련하여 두 가지 주요 아키텍처가 등장했는데, 대형 언어 모델(LLM)에서 영감을 받은 트랜스포머 기반 모델과 그래프 신경망(Graph Neural Networks)이 그것입니다. 본 논문에서는 사전 훈련된 언어 모델과 그래프 의미론을 결합한 새로운 프레임워크인 DocGraphLM을 소개합니다. 이를 위해 1) 문서를 표현하기 위한 공동 인코더 아키텍처와 2) 문서 그래프를 재구성하기 위한 새로운 링크 예측 접근 방식을 제안합니다. DocGraphLM은 노드 간의 방향과 거리를 예측하며, 이웃 복원을 우선시하고 먼 노드 탐지를 낮추는 수렴적 공동 손실 함수를 사용합니다. 최신 기술(State-of-the-Art, SotA) 데이터셋 세 가지에 대한 실험 결과, 그래프 특징을 도입함으로써 정보 추출(IE) 및 질문 응답(QA) 작업에서 지속적인 성능 향상을 보였습니다. 또한, 링크 예측만을 통해 구성되었음에도 불구하고 그래프 특징을 도입함으로써 학습 과정에서의 수렴 속도가 가속화되었음을 보고합니다.

English

Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.

DocGraphLM: 정보 추출을 위한 문서 그래프 언어 모델

DocGraphLM: Documental Graph Language Model for Information Extraction

초록

Support