ChatPaper.aiChatPaper

TextHawk:探索多模式大型語言模型的高效細粒度感知

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

April 14, 2024
作者: Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng
cs.AI

摘要

多模式大型語言模型(MLLMs)在各種多模式任務上展現了令人印象深刻的成果。然而,大多數現有的MLLM並不適合於需要細粒度圖像感知和信息壓縮的以文件為導向的任務。本文介紹了TextHawk,這是一種專門為文件導向任務而設計的MLLM,同時保留了MLLM的一般能力。TextHawk旨在通過設計四個專用組件來探索高效的細粒度感知。首先,提出了一個重新採樣和重新排列(ReSA)模塊,以減少文件文本中的冗余並降低MLLM的計算成本。我們通過提出可擴展位置嵌入(SPEs)來編碼每個局部特徵的位置,從而保留各種圖像尺寸的可擴展性。然後採用了一個查詢提議網絡(QPN)來在不同子圖像之間動態初始化查詢。為了進一步增強MLLM的細粒度視覺感知能力,我們設計了一個多級交叉注意力(MLCA)機制,捕捉文件圖像的層次結構和語義關係。此外,我們通過將多模式文件數據與Gemini Pro豐富,創建了一個針對文件導向任務的新指令調整數據集。我們在一般和文件導向的MLLM基準上進行了廣泛的實驗,並展示了TextHawk優於最先進方法的表現,顯示了其在細粒度文件感知和一般能力方面的有效性和優越性。
English
Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.

Summary

AI-Generated Summary

PDF110December 15, 2024