TinyLLaVA：小规模大型多模态模型框架

摘要

我们提出了TinyLLaVA框架，为设计和分析小规模大型多模态模型（LMMs）提供了统一的视角。我们通过实证研究了不同视觉编码器、连接模块、语言模型、训练数据和训练方案的影响。我们的大量实验表明，优质数据与更好的训练方案相结合，较小的LMMs可以始终达到与较大LMMs相当的性能。在我们的框架下，我们训练了一系列小规模LMMs。我们最佳模型TinyLLaVA-3.1B在整体性能上优于现有的7B模型，如LLaVA-1.5和Qwen-VL。我们希望我们的发现可以作为未来研究在数据扩展、训练设置和模型选择方面的基准。我们的模型权重和代码将会公开发布。

English

We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.

TinyLLaVA：小规模大型多模态模型框架

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

摘要

Support