mPLUG-DocOwl 1.5: OCR 없이 문서 이해를 위한 통합 구조 학습

초록

구조 정보는 문서, 표, 차트 등 텍스트가 풍부한 이미지의 의미를 이해하는 데 있어 핵심적입니다. 기존의 시각적 문서 이해를 위한 다중모달 대형 언어 모델(MLLMs)은 텍스트 인식 능력을 갖추고 있지만, 텍스트가 풍부한 문서 이미지에 대한 일반적인 구조 이해 능력은 부족합니다. 본 연구에서는 시각적 문서 이해에서 구조 정보의 중요성을 강조하고, MLLMs의 성능을 향상시키기 위해 통합 구조 학습(Unified Structure Learning)을 제안합니다. 우리의 통합 구조 학습은 문서, 웹페이지, 표, 차트, 자연 이미지 등 5개 영역에 걸친 구조 인식 파싱 작업과 다중 수준 텍스트 위치 지정 작업으로 구성됩니다. 구조 정보를 더 효과적으로 인코딩하기 위해, 우리는 간단하면서도 효과적인 비전-투-텍스트 모듈인 H-Reducer를 설계했습니다. 이 모듈은 레이아웃 정보를 유지하면서도 컨볼루션을 통해 수평적으로 인접한 패치를 병합하여 시각적 특징의 길이를 줄임으로써, LLM이 고해상도 이미지를 더 효율적으로 이해할 수 있도록 합니다. 또한, 공개적으로 이용 가능한 텍스트가 풍부한 이미지에 대해 구조 인식 텍스트 시퀀스와 다중 수준의 텍스트 및 바운딩 박스 쌍을 구성하여, 구조 학습을 지원하는 포괄적인 학습 데이터셋 DocStruct4M을 구축했습니다. 마지막으로, 문서 영역에서 상세한 설명 능력을 촉발하기 위해 소규모이지만 고품질의 추론 튜닝 데이터셋 DocReason25K를 구성했습니다. 우리의 모델 DocOwl 1.5는 10개의 시각적 문서 이해 벤치마크에서 최첨단 성능을 달성하며, 7B LLM을 사용한 MLLMs의 SOTA 성능을 10개 벤치마크 중 5개에서 10점 이상 향상시켰습니다. 우리의 코드, 모델, 데이터셋은 https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5에서 공개되어 있습니다.

English

Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

mPLUG-DocOwl 1.5: OCR 없이 문서 이해를 위한 통합 구조 학습

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

초록

Support