ChatPaper.aiChatPaper

mPLUG-DocOwl:用於文件理解的模組化多模式大型語言模型

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

July 4, 2023
作者: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang
cs.AI

摘要

文件理解指的是自動從各種類型的數位文件中提取、分析和理解信息,例如網頁。現有的多模型大型語言模型(MLLMs),包括 mPLUG-Owl,在淺層無OCR文本識別方面展示了有前途的零-shot能力,表明它們在無OCR文件理解方面具有潛力。然而,沒有特定領域的訓練,這些模型往往會忽略細粒度的OCR特徵,例如複雜的表格或大塊文本,這些對於無OCR文件理解至關重要。在本文中,我們提出了基於 mPLUG-Owl 的 mPLUG-DocOwl 用於無OCR文件理解。具體來說,我們首先構建了一個包含各種視覺-文本理解任務的指導調整數據集。然後,通過我們統一的指導調整策略,將模型聯合訓練於僅語言、通用視覺-語言和文件指導調整數據集上,加強了無OCR文件理解能力。我們還建立了一個無OCR文件指導理解評估集 LLMDoc,以更好地比較模型在指導遵從性和文件理解方面的能力。實驗結果表明,我們的模型優於現有的多模態模型,展示了其強大的文件理解能力。此外,在沒有特定微調的情況下,mPLUG-DocOwl 在各種下游任務上具有良好的泛化能力。我們的代碼、模型、訓練數據和評估集可在 https://github.com/X-PLUG/mPLUG-DocOwl 上獲得。
English
Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.
PDF141December 15, 2024