mPLUG-DocOwl: Gemodulariseerd Multimodaal Taalmodel voor Documentbegrip

Samenvatting

Document understanding verwijst naar het automatisch extraheren, analyseren en begrijpen van informatie uit verschillende soorten digitale documenten, zoals een webpagina. Bestaande Multi-model Large Language Models (MLLMs), waaronder mPLUG-Owl, hebben veelbelovende zero-shot mogelijkheden getoond in oppervlakkige OCR-vrije tekstherkenning, wat hun potentieel voor OCR-vrij documentbegrip aangeeft. Desondanks neigen deze modellen, zonder domeinspecifieke training, fijnmazige OCR-kenmerken te negeren, zoals complexe tabellen of grote tekstblokken, die essentieel zijn voor OCR-vrij documentbegrip. In dit artikel stellen we mPLUG-DocOwl voor, gebaseerd op mPLUG-Owl, voor OCR-vrij documentbegrip. Specifiek construeren we eerst een instructieafstemmingsdataset met een breed scala aan visueel-tekstbegriptaken. Vervolgens versterken we het OCR-vrije documentbegrip door het model gezamenlijk te trainen op taal-only, algemene visie-en-taal, en documentinstructieafstemmingsdataset met onze geünificeerde instructieafstemmingsstrategie. We bouwen ook een OCR-vrij documentinstructiebegrip-evaluatieset LLMDoc om de mogelijkheden van modellen op instructiecompliance en documentbegrip beter te vergelijken. Experimentele resultaten tonen aan dat ons model bestaande multimodale modellen overtreft, wat zijn sterke documentbegripvermogen aantoont. Bovendien generaliseert mPLUG-DocOwl, zonder specifieke fine-tuning, goed op verschillende downstreamtaken. Onze code, modellen, trainingsdata en evaluatieset zijn beschikbaar op https://github.com/X-PLUG/mPLUG-DocOwl.

English

Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

mPLUG-DocOwl: Gemodulariseerd Multimodaal Taalmodel voor Documentbegrip

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Samenvatting

Support