盡情欣賞：混合解析度適應多模態大型語言模型

摘要

儘管現有的多模式大型語言模型（MLLMs）取得了顯著進展，但在細粒度視覺識別方面仍然表現不佳。與以往研究相反，我們從圖像解析度的角度研究了這個問題，並揭示了低解析度和高解析度視覺特徵的組合可以有效地補救這一不足。基於這一觀察，我們提出了一種新穎且高效的方法，稱為解析度混合適應（MRA）。具體而言，MRA採用兩個視覺路徑來處理不同解析度的圖像，其中高解析度的視覺信息通過新穎的解析度混合適配器（MR-Adapters）嵌入到低解析度路徑中。這種設計還大大減少了MLLMs的輸入序列長度。為了驗證MRA，我們將其應用於最近的一個名為LLaVA的MLLM，並將新模型稱為LLaVA-HR。我們在11個視覺語言（VL）任務上進行了大量實驗，結果顯示LLaVA-HR在8個VL任務上優於現有的MLLMs，例如在TextVQA上提高了9.4%。更重要的是，LLaVA-HR的訓練和推理在MRA的幫助下仍然高效，例如，訓練時間為20小時，推理速度比LLaVA-1.5快3倍。源代碼已發布在以下網址：https://github.com/luogen1996/LLaVA-HR。

English

Despite remarkable progress, existing multimodal large language models (MLLMs) are still inferior in granular visual recognition. Contrary to previous works, we study this problem from the perspective of image resolution, and reveal that a combination of low- and high-resolution visual features can effectively mitigate this shortcoming. Based on this observation, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also greatly reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g., +9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR remain efficient with MRA, e.g., 20 training hours and 3times inference speed than LLaVA-1.5. Source codes are released at: https://github.com/luogen1996/LLaVA-HR.

盡情欣賞：混合解析度適應多模態大型語言模型

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

摘要

Support