mPLUG-DocOwl:用於文件理解的模組化多模式大型語言模型
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
July 4, 2023
作者: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang
cs.AI
摘要
文件理解指的是自動從各種類型的數位文件中提取、分析和理解信息,例如網頁。現有的多模型大型語言模型(MLLMs),包括 mPLUG-Owl,在淺層無OCR文本識別方面展示了有前途的零-shot能力,表明它們在無OCR文件理解方面具有潛力。然而,沒有特定領域的訓練,這些模型往往會忽略細粒度的OCR特徵,例如複雜的表格或大塊文本,這些對於無OCR文件理解至關重要。在本文中,我們提出了基於 mPLUG-Owl 的 mPLUG-DocOwl 用於無OCR文件理解。具體來說,我們首先構建了一個包含各種視覺-文本理解任務的指導調整數據集。然後,通過我們統一的指導調整策略,將模型聯合訓練於僅語言、通用視覺-語言和文件指導調整數據集上,加強了無OCR文件理解能力。我們還建立了一個無OCR文件指導理解評估集 LLMDoc,以更好地比較模型在指導遵從性和文件理解方面的能力。實驗結果表明,我們的模型優於現有的多模態模型,展示了其強大的文件理解能力。此外,在沒有特定微調的情況下,mPLUG-DocOwl 在各種下游任務上具有良好的泛化能力。我們的代碼、模型、訓練數據和評估集可在 https://github.com/X-PLUG/mPLUG-DocOwl 上獲得。
English
Document understanding refers to automatically extract, analyze and
comprehend information from various types of digital documents, such as a web
page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl,
have demonstrated promising zero-shot capabilities in shallow OCR-free text
recognition, indicating their potential for OCR-free document understanding.
Nevertheless, without in-domain training, these models tend to ignore
fine-grained OCR features, such as sophisticated tables or large blocks of
text, which are essential for OCR-free document understanding. In this paper,
we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding.
Specifically, we first construct a instruction tuning dataset featuring a wide
range of visual-text understanding tasks. Then, we strengthen the OCR-free
document understanding ability by jointly train the model on language-only,
general vision-and-language, and document instruction tuning dataset with our
unified instruction tuning strategy. We also build an OCR-free document
instruction understanding evaluation set LLMDoc to better compare models'
capabilities on instruct compliance and document understanding. Experimental
results show that our model outperforms existing multi-modal models,
demonstrating its strong ability of document understanding. Besides, without
specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream
tasks. Our code, models, training data and evaluation set are available at
https://github.com/X-PLUG/mPLUG-DocOwl.