InfiMM-HD:高解析度多模態理解的重大進展
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
March 3, 2024
作者: Haogeng Liu, Quanzeng You, Xiaotian Han, Yiqi Wang, Bohan Zhai, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
cs.AI
摘要
最近,多模式大型語言模型(MLLMs)已經取得顯著進展。然而,在準確識別和理解高解析度圖像中的細節方面仍存在挑戰。儘管這是發展強大 MLLMs 不可或缺的部分,但這個領域仍未受到充分調查。為應對這一挑戰,我們的工作引入了 InfiMM-HD,這是一種專門設計用於處理不同解析度圖像並具有低計算負擔的新型架構。這一創新有助於將 MLLMs 擴展到更高解析度的能力。InfiMM-HD 包括交叉注意力模塊和視覺窗口,以降低計算成本。通過將這種架構設計與四階段訓練流程相結合,我們的模型能夠高效且具有成本效益地實現改進的視覺感知。實證研究強調了 InfiMM-HD 的穩健性和有效性,為相關領域的探索開辟了新途徑。代碼和模型可在 https://huggingface.co/Infi-MM/infimm-hd 找到。
English
Multimodal Large Language Models (MLLMs) have experienced significant
advancements recently. Nevertheless, challenges persist in the accurate
recognition and comprehension of intricate details within high-resolution
images. Despite being indispensable for the development of robust MLLMs, this
area remains underinvestigated. To tackle this challenge, our work introduces
InfiMM-HD, a novel architecture specifically designed for processing images of
different resolutions with low computational overhead. This innovation
facilitates the enlargement of MLLMs to higher-resolution capabilities.
InfiMM-HD incorporates a cross-attention module and visual windows to reduce
computation costs. By integrating this architectural design with a four-stage
training pipeline, our model attains improved visual perception efficiently and
cost-effectively. Empirical study underscores the robustness and effectiveness
of InfiMM-HD, opening new avenues for exploration in related areas. Codes and
models can be found at https://huggingface.co/Infi-MM/infimm-hd