NaViL：重新思考數據限制下原生多模態大型語言模型的擴展特性

摘要

组合式训练已成为现有多模态大语言模型（MLLMs）中的默认范式，其中预训练的视觉编码器通过连续的多模态预训练与预训练的大语言模型相连接。然而，由于这种分离式训练，该范式的多模态扩展特性仍难以探索。本文聚焦于以端到端方式对MLLMs进行原生训练，并在数据受限的实际设置下系统研究其设计空间和扩展特性。通过对MLLM中各种选择的细致研究，我们获得了在性能与训练成本之间最佳平衡的元架构。随后，我们进一步探索了原生MLLM的扩展特性，并指出了视觉编码器与大语言模型之间正相关的扩展关系。基于这些发现，我们提出了一个名为NaViL的原生MLLM，并配以简单且成本效益高的训练方案。在14个多模态基准上的实验结果证实了NaViL相较于现有MLLMs的竞争优势。此外，我们的发现和结果为未来原生MLLMs的研究提供了深入的见解。

English

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

NaViL：重新思考數據限制下原生多模態大型語言模型的擴展特性

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

摘要

Support