超越LLaVA-HD：深入研究高分辨率大型多模态模型

摘要

高分辨率的清晰视觉是大型多模态模型（LMMs）的基础，已被证明对视觉感知和推理至关重要。现有作品通常采用直接的分辨率放大方法，其中图像包括全局和局部分支，后者是切割的图像块，但调整为与前者相同的分辨率。这意味着更高的分辨率需要更多的局部块，导致高昂的计算开销，同时，局部图像标记的主导可能会减弱全局上下文。在本文中，我们深入探讨了这些问题，并提出了一个新的框架以及一个精心设计的优化策略。具体而言，我们利用适配器混合从全局视图中提取上下文信息，基于观察到不同的适配器在不同任务上表现出色。关于局部块，引入了可学习的查询嵌入以减少图像标记，最重要的标记将通过基于相似性的选择器进一步选择，这些标记占用户问题的重要部分。我们的实证结果表明了“少即是多”的模式，利用更少但更具信息量的局部图像标记可以提高性能。此外，训练策略面临着重大挑战，因为全局挖掘块和局部压缩块的同时端到端训练并不能产生最佳结果。因此，我们主张采用交替训练方式，确保在全局和局部方面之间平衡学习。最后，我们还介绍了一个对图像细节要求很高的具有挑战性的数据集，增强了局部压缩层的训练。所提出的方法，称为具有复杂任务、局部图像压缩和全局专家混合的LMM（SliME），在各种基准测试中取得了领先的性能，仅使用了200万个训练数据。

English

Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local patches, learnable query embeddings are introduced to reduce image tokens, the most important tokens accounting for the user question will be further selected by a similarity-based selector. Our empirical results demonstrate a `less is more' pattern, where utilizing fewer but more informative local image tokens leads to improved performance. Besides, a significant challenge lies in the training strategy, as simultaneous end-to-end training of the global mining block and local compression block does not yield optimal results. We thus advocate for an alternating training way, ensuring balanced learning between global and local aspects. Finally, we also introduce a challenging dataset with high requirements for image detail, enhancing the training of the local compression layer. The proposed method, termed LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME), achieves leading performance across various benchmarks with only 2 million training data.

超越LLaVA-HD：深入研究高分辨率大型多模态模型

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

摘要

Support