지시-튜닝된 대규모 멀티모달 모델의 확장에 대한 실증적 연구

초록

시각적 명령어 튜닝(Visual instruction tuning)은 최근 LLaVA와 MiniGPT-4와 같은 오픈소스 대형 멀티모달 모델(LMM)을 통해 고무적인 진전을 보여주고 있습니다. 그러나 대부분의 기존 오픈소스 LMM 연구는 13B 파라미터 이하의 모델을 사용하여 수행되었습니다. 본 논문에서는 LLaVA를 33B 및 65B/70B 규모로 확장한 실험적 연구를 소개하고, 이미지 해상도, 데이터 혼합, LoRA/QLoRA와 같은 파라미터 효율적 학습 방법에 대한 탐구 결과를 공유합니다. 이러한 요소들은 실제 작업에서 멀티모달 및 언어 능력에 미치는 영향을 평가하기 위해 검증되었습니다. 연구 결과, LMM의 규모를 확장하는 것이 모델 성능을 지속적으로 향상시키고 언어 능력을 개선하는 것으로 나타났으며, LoRA/QLoRA 튜닝의 성능은 전체 모델 미세 조정(fine-tuning)의 성능과 비슷한 수준임을 확인했습니다. 또한, 더 높은 이미지 해상도와 멀티모달-언어 데이터의 혼합이 LMM 성능 향상에 중요한 역할을 하며, 시각적 명령어 튜닝이 때로는 LMM의 순수 언어 능력을 개선할 수 있다는 점이 강조되었습니다. 이 연구가 더 큰 규모의 최첨단 LMM 연구를 보다 접근 가능하게 만들어 미래 연구를 위한 더 강력한 기준선을 마련하는 데 도움이 되기를 바랍니다. 코드와 체크포인트는 공개될 예정입니다.

English

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

지시-튜닝된 대규모 멀티모달 모델의 확장에 대한 실증적 연구

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

초록

Support