超越LLaVA-HD:深入研究高分辨率大型多模态模型
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
June 12, 2024
作者: Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin
cs.AI
摘要
高分辨率的清晰视觉是大型多模态模型(LMMs)的基础,已被证明对视觉感知和推理至关重要。现有作品通常采用直接的分辨率放大方法,其中图像包括全局和局部分支,后者是切割的图像块,但调整为与前者相同的分辨率。这意味着更高的分辨率需要更多的局部块,导致高昂的计算开销,同时,局部图像标记的主导可能会减弱全局上下文。在本文中,我们深入探讨了这些问题,并提出了一个新的框架以及一个精心设计的优化策略。具体而言,我们利用适配器混合从全局视图中提取上下文信息,基于观察到不同的适配器在不同任务上表现出色。关于局部块,引入了可学习的查询嵌入以减少图像标记,最重要的标记将通过基于相似性的选择器进一步选择,这些标记占用户问题的重要部分。我们的实证结果表明了“少即是多”的模式,利用更少但更具信息量的局部图像标记可以提高性能。此外,训练策略面临着重大挑战,因为全局挖掘块和局部压缩块的同时端到端训练并不能产生最佳结果。因此,我们主张采用交替训练方式,确保在全局和局部方面之间平衡学习。最后,我们还介绍了一个对图像细节要求很高的具有挑战性的数据集,增强了局部压缩层的训练。所提出的方法,称为具有复杂任务、局部图像压缩和全局专家混合的LMM(SliME),在各种基准测试中取得了领先的性能,仅使用了200万个训练数据。
English
Seeing clearly with high resolution is a foundation of Large Multimodal
Models (LMMs), which has been proven to be vital for visual perception and
reasoning. Existing works usually employ a straightforward resolution upscaling
method, where the image consists of global and local branches, with the latter
being the sliced image patches but resized to the same resolution as the
former. This means that higher resolution requires more local patches,
resulting in exorbitant computational expenses, and meanwhile, the dominance of
local image tokens may diminish the global context. In this paper, we dive into
the problems and propose a new framework as well as an elaborate optimization
strategy. Specifically, we extract contextual information from the global view
using a mixture of adapters, based on the observation that different adapters
excel at different tasks. With regard to local patches, learnable query
embeddings are introduced to reduce image tokens, the most important tokens
accounting for the user question will be further selected by a similarity-based
selector. Our empirical results demonstrate a `less is more' pattern, where
utilizing fewer but more informative local image tokens leads to
improved performance. Besides, a significant challenge lies in the training
strategy, as simultaneous end-to-end training of the global mining block and
local compression block does not yield optimal results. We thus advocate for an
alternating training way, ensuring balanced learning between global and local
aspects. Finally, we also introduce a challenging dataset with high
requirements for image detail, enhancing the training of the local compression
layer. The proposed method, termed LMM with Sophisticated Tasks, Local image
compression, and Mixture of global Experts (SliME), achieves leading
performance across various benchmarks with only 2 million training data.Summary
AI-Generated Summary