mPLUG-DocOwl：用於文件理解的模組化多模式大型語言模型

摘要

文件理解指的是自動從各種類型的數位文件中提取、分析和理解信息，例如網頁。現有的多模型大型語言模型（MLLMs），包括 mPLUG-Owl，在淺層無OCR文本識別方面展示了有前途的零-shot能力，表明它們在無OCR文件理解方面具有潛力。然而，沒有特定領域的訓練，這些模型往往會忽略細粒度的OCR特徵，例如複雜的表格或大塊文本，這些對於無OCR文件理解至關重要。在本文中，我們提出了基於 mPLUG-Owl 的 mPLUG-DocOwl 用於無OCR文件理解。具體來說，我們首先構建了一個包含各種視覺-文本理解任務的指導調整數據集。然後，通過我們統一的指導調整策略，將模型聯合訓練於僅語言、通用視覺-語言和文件指導調整數據集上，加強了無OCR文件理解能力。我們還建立了一個無OCR文件指導理解評估集 LLMDoc，以更好地比較模型在指導遵從性和文件理解方面的能力。實驗結果表明，我們的模型優於現有的多模態模型，展示了其強大的文件理解能力。此外，在沒有特定微調的情況下，mPLUG-DocOwl 在各種下游任務上具有良好的泛化能力。我們的代碼、模型、訓練數據和評估集可在 https://github.com/X-PLUG/mPLUG-DocOwl 上獲得。

English

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

mPLUG-DocOwl：用於文件理解的模組化多模式大型語言模型

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

摘要

Support