ChatPaper.aiChatPaper

mPLUG-DocOwl2:面向無OCR多頁文檔理解的高分辨率壓縮技術

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

September 5, 2024
作者: Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
cs.AI

摘要

多模態大型語言模型(MLLMs)通過提升文件影像的支援解析度,在無需光學字元辨識的文件理解任務中展現出優異效能。然而此舉需為單一文件影像生成數千個視覺標記,導致GPU記憶體負荷過重且推論速度下降,在多頁文件理解任務中尤為明顯。為解決這些挑戰,本研究提出高解析度文件壓縮模組,透過低解析度全域視覺特徵引導,將每張高解析度文件影像壓縮至324個標記。基於此壓縮模組,為強化多頁文件理解能力並平衡標記效率與問答效能,我們採用三階段訓練框架開發DocOwl2模型:單圖預訓練、多圖持續預訓練與多任務微調。DocOwl2在多頁文件理解基準測試中創下最新效能紀錄,並將首標記延遲降低逾50%,展現出在多頁問答、附證據頁面解釋及跨頁面結構理解方面的先進能力。此外,與使用相似資料訓練的單圖MLLMs相比,DocOwl2僅需不到20%的視覺標記量即可達成相當的單頁理解效能。相關程式碼、模型與資料已公開於:https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2。
English
Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.
PDF264November 14, 2024