mPLUG-DocOwl 1.5:統一結構學習,無需OCR的文件理解
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
March 19, 2024
作者: Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
cs.AI
摘要
結構信息對於理解文本豐富的圖像(如文檔、表格和圖表)的語義至關重要。現有的用於視覺文檔理解的多模態大型語言模型(MLLMs)具備文本識別能力,但缺乏對於文本豐富的文檔圖像的一般結構理解能力。在這項工作中,我們強調結構信息在視覺文檔理解中的重要性,並提出統一結構學習以提升MLLMs的性能。我們的統一結構學習包括結構感知解析任務和跨5個領域(文檔、網頁、表格、圖表和自然圖像)的多粒度文本定位任務。為了更好地編碼結構信息,我們設計了一個簡單而有效的視覺到文本模塊H-Reducer,它不僅可以保持布局信息,還可以通過卷積合併水平相鄰的塊來減少視覺特徵的長度,使LLM能夠更有效地理解高分辨率圖像。此外,通過構建結構感知文本序列和多粒度文本和邊界框對於公開可用的文本豐富圖像,我們構建了一個全面的訓練集DocStruct4M來支持結構學習。最後,我們構建了一個小而高質量的推理調整數據集DocReason25K,以觸發文檔領域中的詳細解釋能力。我們的模型DocOwl 1.5在10個視覺文檔理解基準測試中實現了最先進的性能,將MLLMs的SOTA性能提高了超過10個百分點中的5個基準測試。我們的代碼、模型和數據集可在以下鏈接公開獲取:https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5。
English
Structure information is critical for understanding the semantics of
text-rich images, such as documents, tables, and charts. Existing Multimodal
Large Language Models (MLLMs) for Visual Document Understanding are equipped
with text recognition ability but lack general structure understanding
abilities for text-rich document images. In this work, we emphasize the
importance of structure information in Visual Document Understanding and
propose the Unified Structure Learning to boost the performance of MLLMs. Our
Unified Structure Learning comprises structure-aware parsing tasks and
multi-grained text localization tasks across 5 domains: document, webpage,
table, chart, and natural image. To better encode structure information, we
design a simple and effective vision-to-text module H-Reducer, which can not
only maintain the layout information but also reduce the length of visual
features by merging horizontal adjacent patches through convolution, enabling
the LLM to understand high-resolution images more efficiently. Furthermore, by
constructing structure-aware text sequences and multi-grained pairs of texts
and bounding boxes for publicly available text-rich images, we build a
comprehensive training set DocStruct4M to support structure learning. Finally,
we construct a small but high-quality reasoning tuning dataset DocReason25K to
trigger the detailed explanation ability in the document domain. Our model
DocOwl 1.5 achieves state-of-the-art performance on 10 visual document
understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM
by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are
publicly available at
https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.Summary
AI-Generated Summary