mPLUG-DocOwl: 문서 이해를 위한 모듈화된 멀티모달 대형 언어 모델

초록

문서 이해(Document Understanding)는 웹 페이지와 같은 다양한 유형의 디지털 문서로부터 정보를 자동으로 추출, 분석 및 이해하는 것을 의미합니다. 기존의 다중 모드 대형 언어 모델(Multi-model Large Language Models, MLLMs)인 mPLUG-Owl을 포함한 모델들은 OCR(광학 문자 인식)을 사용하지 않은 얕은 수준의 텍스트 인식에서 유망한 제로샷(zero-shot) 능력을 보여주며, OCR을 사용하지 않은 문서 이해의 잠재력을 시사했습니다. 그러나 도메인 내 훈련 없이는 이러한 모델들은 정교한 테이블이나 대량의 텍스트 블록과 같은 세밀한 OCR 특징을 무시하는 경향이 있으며, 이는 OCR을 사용하지 않은 문서 이해에 필수적입니다. 본 논문에서는 OCR을 사용하지 않은 문서 이해를 위해 mPLUG-Owl을 기반으로 한 mPLUG-DocOwl을 제안합니다. 구체적으로, 먼저 다양한 시각-텍스트 이해 작업을 특징으로 하는 지시 튜닝(instruction tuning) 데이터셋을 구축합니다. 그런 다음, 언어 전용, 일반적인 시각-언어, 그리고 문서 지시 튜닝 데이터셋을 통합 지시 튜닝 전략으로 공동 훈련하여 OCR을 사용하지 않은 문서 이해 능력을 강화합니다. 또한, 모델의 지시 준수 및 문서 이해 능력을 더 잘 비교하기 위해 OCR을 사용하지 않은 문서 지시 이해 평가 세트인 LLMDoc을 구축합니다. 실험 결과는 우리의 모델이 기존의 다중 모드 모델들을 능가하며, 강력한 문서 이해 능력을 보여줍니다. 또한, 특정한 미세 조정 없이도 mPLUG-DocOwl은 다양한 하위 작업에서 잘 일반화됩니다. 우리의 코드, 모델, 훈련 데이터 및 평가 세트는 https://github.com/X-PLUG/mPLUG-DocOwl에서 확인할 수 있습니다.

English

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

mPLUG-DocOwl: 문서 이해를 위한 모듈화된 멀티모달 대형 언어 모델

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

초록

Support