TextHawk: マルチモーダル大規模言語モデルの効率的な細粒度知覚の探求

要旨

マルチモーダル大規模言語モデル（MLLM）は、さまざまなマルチモーダルタスクで印象的な結果を示しています。しかし、既存のMLLMの多くは、細かな画像認識と情報圧縮を必要とするドキュメント指向のタスクには適していません。本論文では、MLLMの一般的な能力を維持しつつ、ドキュメント指向のタスクに特化して設計されたTextHawkを紹介します。TextHawkは、4つの専用コンポーネントを設計することで、効率的な細粒度認識を探求することを目的としています。まず、ドキュメントテキストの冗長性を削減し、MLLMの計算コストを低減するために、ReSampling and ReArrangement（ReSA）モジュールを提案します。さまざまな画像サイズのスケーラビリティを維持できるScalable Positional Embeddings（SPE）を提示することで、各ローカル特徴の位置をエンコードする方法を探ります。次に、Query Proposal Network（QPN）を採用し、異なるサブ画像間でクエリを動的に初期化します。さらに、MLLMの細粒度視覚認識能力を強化するために、ドキュメント画像の階層構造と意味的関係を捉えるMulti-Level Cross-Attention（MLCA）メカニズムを設計します。さらに、Gemini Proを使用してマルチモーダルドキュメントデータを充実させることで、ドキュメント指向タスクのための新しい命令チューニングデータセットを作成します。一般的なMLLMベンチマークとドキュメント指向のMLLMベンチマークの両方で広範な実験を行い、TextHawkが最先端の手法を上回り、細粒度ドキュメント認識と一般的な能力においてその有効性と優位性を示すことを実証します。

English

Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.

TextHawk: マルチモーダル大規模言語モデルの効率的な細粒度知覚の探求

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

要旨

Support