TextHawk：探索多模态大型语言模型的高效细粒度感知

摘要

多模态大型语言模型（MLLMs）在各种多模态任务上展现出令人印象深刻的成果。然而，大多数现有的MLLMs并不适用于需要精细的图像感知和信息压缩的面向文档的任务。本文介绍了TextHawk，这是一个专门为面向文档任务设计的MLLM，同时保留了MLLMs的一般能力。TextHawk旨在通过设计四个专用组件来探索高效的细粒度感知。首先，提出了一个重新采样和重新排列（ReSA）模块，以减少文档文本中的冗余并降低MLLM的计算成本。我们探索通过引入可扩展位置嵌入（SPEs）来编码每个局部特征的位置，以保持各种图像尺寸的可扩展性。然后采用查询提议网络（QPN）来动态初始化不同子图像之间的查询。为了进一步增强MLLM的细粒度视觉感知能力，我们设计了一个多级交叉注意力（MLCA）机制，捕捉文档图像的层次结构和语义关系。此外，我们通过将多模态文档数据与Gemini Pro进行丰富，创建了一个面向文档任务的新指令调优数据集。我们在通用和面向文档的MLLM基准上进行了大量实验，并展示了TextHawk优于最先进方法的表现，显示了其在细粒度文档感知和一般能力方面的有效性和优越性。

English

Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.

TextHawk：探索多模态大型语言模型的高效细粒度感知

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

摘要

Support