InfiMM-HD:高分辨率多模态理解的重大进展
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
March 3, 2024
作者: Haogeng Liu, Quanzeng You, Xiaotian Han, Yiqi Wang, Bohan Zhai, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
cs.AI
摘要
近期,多模态大型语言模型(MLLMs)取得了显著进展。然而,在准确识别和理解高分辨率图像中的复杂细节方面仍存在挑战。尽管这一领域对于健壮的MLLMs的发展至关重要,但仍未得到充分调查。为了解决这一挑战,我们的工作引入了InfiMM-HD,这是一种专门设计用于处理不同分辨率图像且计算开销较低的新型架构。这一创新有助于将MLLMs扩展到更高分辨率的能力。InfiMM-HD结合了交叉注意力模块和视觉窗口,以降低计算成本。通过将这种架构设计与四阶段训练流程相结合,我们的模型有效且经济地实现了改进的视觉感知。实证研究强调了InfiMM-HD的健壮性和有效性,为相关领域的探索开辟了新途径。代码和模型可在https://huggingface.co/Infi-MM/infimm-hd找到。
English
Multimodal Large Language Models (MLLMs) have experienced significant
advancements recently. Nevertheless, challenges persist in the accurate
recognition and comprehension of intricate details within high-resolution
images. Despite being indispensable for the development of robust MLLMs, this
area remains underinvestigated. To tackle this challenge, our work introduces
InfiMM-HD, a novel architecture specifically designed for processing images of
different resolutions with low computational overhead. This innovation
facilitates the enlargement of MLLMs to higher-resolution capabilities.
InfiMM-HD incorporates a cross-attention module and visual windows to reduce
computation costs. By integrating this architectural design with a four-stage
training pipeline, our model attains improved visual perception efficiently and
cost-effectively. Empirical study underscores the robustness and effectiveness
of InfiMM-HD, opening new avenues for exploration in related areas. Codes and
models can be found at https://huggingface.co/Infi-MM/infimm-hd