눈을 즐겁게 하라: 다중모달 대규모 언어 모델을 위한 혼합 해상도 적응 기법

초록

놀라운 발전에도 불구하고, 기존의 다중모드 대형 언어 모델(MLLMs)은 여전히 세밀한 시각 인식에서 부족한 성능을 보인다. 이전 연구들과는 달리, 우리는 이 문제를 이미지 해상도의 관점에서 연구하였으며, 저해상도와 고해상도 시각 특징의 조합이 이러한 단점을 효과적으로 완화할 수 있음을 밝혀냈다. 이러한 관찰을 바탕으로, 우리는 MLLMs를 위한 새로운 효율적인 방법인 Mixture-of-Resolution Adaptation(MRA)을 제안한다. 특히, MRA는 서로 다른 해상도의 이미지를 위한 두 개의 시각 경로를 채택하며, 고해상도 시각 정보는 새로운 mixture-of-resolution adapters(MR-Adapters)를 통해 저해상도 경로에 내장된다. 이 설계는 또한 MLLMs의 입력 시퀀스 길이를 크게 줄인다. MRA를 검증하기 위해, 우리는 이를 최근의 MLLM인 LLaVA에 적용하고, 새로운 모델을 LLaVA-HR로 명명하였다. 우리는 11개의 시각-언어(VL) 작업에 대한 광범위한 실험을 수행하였으며, 이는 LLaVA-HR이 8개의 VL 작업에서 기존 MLLMs를 능가함을 보여준다. 예를 들어, TextVQA에서 +9.4%의 성능 향상을 보였다. 더 중요한 것은, MRA를 통해 LLaVA-HR의 훈련과 추론 모두 효율적으로 유지된다는 점이다. 예를 들어, 훈련 시간은 20시간이며, LLaVA-1.5보다 3배 빠른 추론 속도를 보인다. 소스 코드는 https://github.com/luogen1996/LLaVA-HR에서 공개되었다.

English

Despite remarkable progress, existing multimodal large language models (MLLMs) are still inferior in granular visual recognition. Contrary to previous works, we study this problem from the perspective of image resolution, and reveal that a combination of low- and high-resolution visual features can effectively mitigate this shortcoming. Based on this observation, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also greatly reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g., +9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR remain efficient with MRA, e.g., 20 training hours and 3times inference speed than LLaVA-1.5. Source codes are released at: https://github.com/luogen1996/LLaVA-HR.

눈을 즐겁게 하라: 다중모달 대규모 언어 모델을 위한 혼합 해상도 적응 기법

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

초록

Support