SPHINX-X：为一系列多模态大型语言模型扩展数据和参数

摘要

我们提出了SPHINX-X，这是基于SPHINX开发的广泛的多模态大型语言模型（MLLM）系列。为了改善架构和训练效率，我们通过去除冗余的视觉编码器、使用跳过令牌绕过完全填充的子图像以及简化多阶段训练为一阶段一体化范式来修改SPHINX框架。为了充分释放MLLM的潜力，我们汇集了一个全面的多领域和多模态数据集，涵盖了语言、视觉和视觉语言任务中的公开资源。我们进一步通过我们策划的OCR密集和Set-of-Mark数据集丰富了这一收藏，扩展了多样性和普适性。通过对不同基础LLM（包括TinyLlama1.1B、InternLM2-7B、LLaMA2-13B和Mixtral8x7B）进行训练，我们获得了一系列在参数大小和多语言能力上有所不同的MLLM。全面的基准测试揭示了多模态性能与数据和参数规模之间的强相关性。代码和模型已发布在https://github.com/Alpha-VLLM/LLaMA2-Accessory。

English

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

SPHINX-X：为一系列多模态大型语言模型扩展数据和参数

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

摘要

Support