NaViL：重新思考数据约束下原生多模态大语言模型的扩展特性

摘要

组合式训练一直是现有多模态大语言模型（MLLMs）的实际范式，其中预训练的视觉编码器通过连续的多模态预训练与预训练的大语言模型（LLMs）相连接。然而，由于这种分离式训练，该范式的多模态扩展特性仍难以探究。本文聚焦于以端到端方式原生训练MLLMs，并在实际数据受限的设定下，系统性地研究其设计空间与扩展特性。通过对MLLM中多种选择的细致研究，我们获得了在性能与训练成本之间最佳平衡的元架构。随后，我们进一步探索了原生MLLM的扩展特性，揭示了视觉编码器与LLMs之间正相关的扩展关系。基于这些发现，我们提出了一种名为NaViL的原生MLLM，并搭配了一套简单且成本效益高的训练方案。在14个多模态基准上的实验结果证实了NaViL相较于现有MLLMs的竞争优势。此外，我们的发现与结果为未来原生MLLMs的深入研究提供了深刻的洞见。

English

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

NaViL：重新思考数据约束下原生多模态大语言模型的扩展特性

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

摘要

Support