mPLUG-DocOwl2:面向无OCR多页文档理解的高分辨率压缩技术
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
September 5, 2024
作者: Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
cs.AI
摘要
多模态大语言模型(MLLMs)通过提升文档图像的分辨率支持,在无需OCR的文档理解任务中取得了显著成效。然而,这一进步伴随着为单张文档图像生成数千个视觉标记的代价,导致GPU内存消耗剧增及推理速度减缓,尤其是在多页文档理解场景下。针对这些挑战,本研究提出了一种高分辨率文档压缩模块(High-resolution DocCompressor),该模块在低分辨率全局视觉特征的引导下,将每张高分辨率文档图像压缩至324个标记。借助这一压缩模块,为增强多页文档理解能力并兼顾标记效率与问答性能,我们开发了DocOwl2模型,采用三阶段训练框架:单图像预训练、多图像持续预训练及多任务微调。DocOwl2在多页文档理解基准测试中创下了新的最优记录,并将首标记延迟减少了超过50%,展现了在多页问答、证据页解释及跨页结构理解方面的先进能力。此外,与在相似数据上训练的单图像MLLMs相比,我们的DocOwl2在单页理解性能上表现相当,而视觉标记数量却不到其20%。我们的代码、模型及数据已公开于https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2。
English
Multimodel Large Language Models(MLLMs) have achieved promising OCR-free
Document Understanding performance by increasing the supported resolution of
document images. However, this comes at the cost of generating thousands of
visual tokens for a single document image, leading to excessive GPU memory and
slower inference times, particularly in multi-page document comprehension. In
this work, to address these challenges, we propose a High-resolution
DocCompressor module to compress each high-resolution document image into 324
tokens, guided by low-resolution global visual features. With this compression
module, to strengthen multi-page document comprehension ability and balance
both token efficiency and question-answering performance, we develop the
DocOwl2 under a three-stage training framework: Single-image Pretraining,
Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new
state-of-the-art across multi-page document understanding benchmarks and
reduces first token latency by more than 50%, demonstrating advanced
capabilities in multi-page questioning answering, explanation with evidence
pages, and cross-page structure understanding. Additionally, compared to
single-image MLLMs trained on similar data, our DocOwl2 achieves comparable
single-page understanding performance with less than 20% of the visual tokens.
Our codes, models, and data are publicly available at
https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.