ChatPaper.aiChatPaper

超越LLaVA-HD:深入探討高解析度大型多模型模型

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

June 12, 2024
作者: Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin
cs.AI

摘要

高分辨率的清晰視覺是大型多模型(LMMs)的基礎,已被證實對視覺感知和推理至關重要。現有作品通常採用直接的解析度放大方法,其中圖像包含全局和局部分支,後者是切割的圖像片段,但調整為與前者相同的解析度。這意味著更高的解析度需要更多的局部片段,導致極高的計算開銷,同時,局部圖像標記的主導地位可能會削弱全局上下文。在本文中,我們深入探討問題並提出一個新的框架以及一個精心的優化策略。具體而言,我們使用各種適配器從全局視圖中提取上下文信息,基於觀察到不同的適配器擅長不同的任務。關於局部片段,引入可學習的查詢嵌入以減少圖像標記,最重要的標記將通過基於相似性的選擇器進一步選擇,這些標記對用戶問題至關重要。我們的實證結果顯示了“少即是多”的模式,利用更少但更具信息量的局部圖像標記可提高性能。此外,一個重要挑戰在於訓練策略,因為全局挖掘塊和局部壓縮塊的同時端對端訓練並不能產生最佳結果。因此,我們主張採用交替訓練方式,確保在全局和局部方面之間平衡學習。最後,我們還介紹了一個對圖像細節要求很高的具有挑戰性的數據集,增強了局部壓縮層的訓練。所提出的方法,稱為具有複雜任務、局部圖像壓縮和全局專家混合(SliME)的LMM,僅使用200萬個訓練數據,在各種基準測試中取得領先性能。
English
Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local patches, learnable query embeddings are introduced to reduce image tokens, the most important tokens accounting for the user question will be further selected by a similarity-based selector. Our empirical results demonstrate a `less is more' pattern, where utilizing fewer but more informative local image tokens leads to improved performance. Besides, a significant challenge lies in the training strategy, as simultaneous end-to-end training of the global mining block and local compression block does not yield optimal results. We thus advocate for an alternating training way, ensuring balanced learning between global and local aspects. Finally, we also introduce a challenging dataset with high requirements for image detail, enhancing the training of the local compression layer. The proposed method, termed LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME), achieves leading performance across various benchmarks with only 2 million training data.

Summary

AI-Generated Summary

PDF142December 8, 2024