mPLUG-DocOwl：文書理解のためのモジュール化マルチモーダル大規模言語モデル

要旨

ドキュメント理解とは、ウェブページなどの様々な種類のデジタル文書から情報を自動的に抽出、分析、理解することを指します。既存のマルチモーダル大規模言語モデル（MLLMs）、例えばmPLUG-Owlは、浅いOCR不要のテキスト認識において有望なゼロショット能力を示しており、OCR不要のドキュメント理解の可能性を示唆しています。しかし、ドメイン内でのトレーニングなしでは、これらのモデルは洗練された表や大きなテキストブロックなどの細かいOCR特徴を無視する傾向があり、これらはOCR不要のドキュメント理解に不可欠です。本論文では、OCR不要のドキュメント理解のためにmPLUG-Owlを基にしたmPLUG-DocOwlを提案します。具体的には、まず、幅広い視覚テキスト理解タスクを特徴とする指示チューニングデータセットを構築します。次に、言語のみ、一般的な視覚と言語、およびドキュメント指示チューニングデータセットを統合した指示チューニング戦略でモデルを共同トレーニングすることで、OCR不要のドキュメント理解能力を強化します。また、指示遵守とドキュメント理解におけるモデルの能力をより良く比較するために、OCR不要のドキュメント指示理解評価セットLLMDocを構築します。実験結果は、我々のモデルが既存のマルチモーダルモデルを上回り、ドキュメント理解の強力な能力を示しています。さらに、特定のファインチューニングなしで、mPLUG-DocOwlは様々な下流タスクで良好に汎化します。我々のコード、モデル、トレーニングデータ、および評価セットはhttps://github.com/X-PLUG/mPLUG-DocOwlで利用可能です。

English

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

mPLUG-DocOwl：文書理解のためのモジュール化マルチモーダル大規模言語モデル

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

要旨

Support