TinyLLaVA: 소규모 대형 멀티모달 모델 프레임워크

초록

우리는 소규모 대형 멀티모달 모델(LMMs)의 설계와 분석을 위한 통합된 관점을 제공하는 TinyLLaVA 프레임워크를 제시합니다. 다양한 비전 인코더, 연결 모듈, 언어 모델, 학습 데이터 및 학습 레시피의 효과를 실증적으로 연구했습니다. 우리의 광범위한 실험 결과, 더 나은 품질의 데이터와 더 나은 학습 레시피를 결합할 경우, 더 작은 LMMs가 더 큰 LMMs와 동등한 성능을 꾸준히 달성할 수 있음을 보여주었습니다. 이 프레임워크 하에서, 우리는 소규모 LMMs 패밀리를 학습시켰습니다. 우리의 최고 모델인 TinyLLaVA-3.1B는 LLaVA-1.5 및 Qwen-VL과 같은 기존의 7B 모델 대비 더 나은 전반적인 성능을 달성했습니다. 우리의 연구 결과가 데이터 스케일링, 학습 설정 및 모델 선택 측면에서 향후 연구의 기준으로 활용되기를 바랍니다. 우리의 모델 가중치와 코드는 공개될 예정입니다.

English

We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.

TinyLLaVA: 소규모 대형 멀티모달 모델 프레임워크

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

초록

Support