ChatPaper.aiChatPaper

mPLUG-DocOwl:用于文档理解的模块化多模态大型语言模型

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

July 4, 2023
作者: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang
cs.AI

摘要

文档理解指的是自动从各种类型的数字文档中提取、分析和理解信息,例如网页。现有的多模型大型语言模型(MLLMs),包括mPLUG-Owl,在浅层无OCR文本识别方面展现出有前途的零-shot能力,表明它们在无OCR文档理解方面具有潜力。然而,没有领域内训练的情况下,这些模型往往会忽略细粒度的OCR特征,如复杂的表格或大块文本,这些对于无OCR文档理解是至关重要的。在本文中,我们提出了基于mPLUG-Owl的mPLUG-DocOwl,用于无OCR文档理解。具体来说,我们首先构建了一个包含各种视觉-文本理解任务的指导调整数据集。然后,通过我们的统一指导调整策略,我们在仅语言、通用视觉-语言和文档指导调整数据集上联合训练模型,加强了无OCR文档理解能力。我们还构建了一个无OCR文档指导理解评估集LLMDoc,以更好地比较模型在指导遵从和文档理解方面的能力。实验结果表明,我们的模型优于现有的多模态模型,展示了其强大的文档理解能力。此外,在没有特定微调的情况下,mPLUG-DocOwl在各种下游任务上具有很好的泛化能力。我们的代码、模型、训练数据和评估集可在https://github.com/X-PLUG/mPLUG-DocOwl 上获取。
English
Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.
PDF141December 15, 2024