ChatPaper.aiChatPaper

mPLUG-DocOwl 1.5:面向无OCR文档理解的统一结构学习

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

March 19, 2024
作者: Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
cs.AI

摘要

结构信息对于理解文本丰富的图像(如文档、表格和图表)的语义至关重要。现有的用于视觉文档理解的多模态大型语言模型(MLLMs)具备文本识别能力,但缺乏对文本丰富的文档图像进行结构理解的能力。在这项工作中,我们强调了结构信息在视觉文档理解中的重要性,并提出了统一结构学习以提升MLLMs的性能。我们的统一结构学习包括结构感知解析任务和跨5个领域(文档、网页、表格、图表和自然图像)的多粒度文本定位任务。为了更好地编码结构信息,我们设计了一个简单而有效的视觉到文本模块H-Reducer,它不仅可以保留布局信息,还可以通过卷积合并水平相邻的补丁来减少视觉特征的长度,使LLM能够更高效地理解高分辨率图像。此外,通过构建结构感知文本序列和多粒度文本与边界框对应的公开可用文本丰富图像,我们构建了一个全面的训练集DocStruct4M来支持结构学习。最后,我们构建了一个小型但高质量的推理调优数据集DocReason25K,以触发文档领域的详细解释能力。我们的模型DocOwl 1.5在10个视觉文档理解基准上取得了最先进的性能,在5/10个基准中将7B LLM的SOTA性能提高了超过10个百分点。我们的代码、模型和数据集可在以下网址公开获取:https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5。
English
Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

Summary

AI-Generated Summary

PDF338December 15, 2024