DocLLM: 다중모드 문서 이해를 위한 레이아웃 인식 생성 언어 모델

초록

양식, 청구서, 영수증, 보고서, 계약서 및 기타 유사한 기록과 같은 기업 문서는 종종 텍스트와 공간 양식의 교차점에서 풍부한 의미를 담고 있습니다. 이러한 문서의 복잡한 레이아웃이 제공하는 시각적 단서는 문서를 효과적으로 이해하는 데 중요한 역할을 합니다. 본 논문에서는 텍스트 의미와 공간 레이아웃을 모두 고려하여 시각적 문서에 대한 추론을 수행하기 위해 기존의 대형 언어 모델(LLM)에 경량 확장을 적용한 DocLLM을 제안합니다. 우리의 모델은 고가의 이미지 인코더를 사용하지 않고 바운딩 박스 정보에만 초점을 맞춰 공간 레이아웃 구조를 통합함으로써 기존의 다중모달 LLM과 차별화됩니다. 구체적으로, 텍스트와 공간 양식 간의 상호 정렬은 고전적인 트랜스포머의 어텐션 메커니즘을 분리된 행렬 집합으로 분해하여 포착합니다. 또한, 텍스트 세그먼트를 채우는 방법을 학습하는 사전 훈련 목표를 설계했습니다. 이 접근법은 시각적 문서에서 자주 접하는 불규칙한 레이아웃과 이질적인 콘텐츠를 해결할 수 있게 해줍니다. 사전 훈련된 모델은 네 가지 핵심 문서 지능 작업을 포함하는 대규모 지시 데이터셋을 사용하여 미세 조정됩니다. 우리는 제안한 솔루션이 모든 작업에서 16개 데이터셋 중 14개에서 최신 기술(SoTA) LLM을 능가하며, 이전에 본 적 없는 5개 데이터셋 중 4개에서도 잘 일반화됨을 입증합니다.

English

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

DocLLM: 다중모드 문서 이해를 위한 레이아웃 인식 생성 언어 모델

DocLLM: A layout-aware generative language model for multimodal document understanding

초록

Support