TinyLLaVA：小型大型多模型框架

摘要

我們提出了TinyLLaVA框架，該框架提供了在設計和分析小型大型多模型模型（LMMs）時的統一視角。我們通過實證研究了不同視覺編碼器、連接模塊、語言模型、訓練數據和訓練配方的影響。我們廣泛的實驗表明，更好的數據質量結合更好的訓練配方，較小的LMMs可以在整體表現上與更大的LMMs保持一致。在我們的框架下，我們訓練了一系列小型LMMs。我們最佳模型TinyLLaVA-3.1B，在整體性能上優於現有的7B模型，如LLaVA-1.5和Qwen-VL。我們希望我們的研究結果可以作為未來在數據擴展、訓練設置和模型選擇方面的基準。我們的模型權重和代碼將被公開。

English

We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.

TinyLLaVA：小型大型多模型框架

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

摘要

Support