mPLUG-DocOwl：用于文档理解的模块化多模态大型语言模型

摘要

文档理解指的是自动从各种类型的数字文档中提取、分析和理解信息，例如网页。现有的多模型大型语言模型（MLLMs），包括mPLUG-Owl，在浅层无OCR文本识别方面展现出有前途的零-shot能力，表明它们在无OCR文档理解方面具有潜力。然而，没有领域内训练的情况下，这些模型往往会忽略细粒度的OCR特征，如复杂的表格或大块文本，这些对于无OCR文档理解是至关重要的。在本文中，我们提出了基于mPLUG-Owl的mPLUG-DocOwl，用于无OCR文档理解。具体来说，我们首先构建了一个包含各种视觉-文本理解任务的指导调整数据集。然后，通过我们的统一指导调整策略，我们在仅语言、通用视觉-语言和文档指导调整数据集上联合训练模型，加强了无OCR文档理解能力。我们还构建了一个无OCR文档指导理解评估集LLMDoc，以更好地比较模型在指导遵从和文档理解方面的能力。实验结果表明，我们的模型优于现有的多模态模型，展示了其强大的文档理解能力。此外，在没有特定微调的情况下，mPLUG-DocOwl在各种下游任务上具有很好的泛化能力。我们的代码、模型、训练数据和评估集可在https://github.com/X-PLUG/mPLUG-DocOwl 上获取。

English

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

mPLUG-DocOwl：用于文档理解的模块化多模态大型语言模型

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

摘要

Support