SPHINX-X：マルチモーダル大規模言語モデルファミリーのためのデータとパラメータのスケーリング

要旨

我々は、SPHINXを基盤とした広範なマルチモーダル大規模言語モデル（MLLM）シリーズであるSPHINX-Xを提案する。アーキテクチャとトレーニング効率を向上させるため、SPHINXフレームワークを改変し、冗長な視覚エンコーダを削除し、完全にパディングされたサブイメージをスキップトークンでバイパスし、多段階トレーニングをワンステージのオールインワンパラダイムに簡素化した。MLLMの潜在能力を最大限に引き出すため、言語、視覚、視覚言語タスクにおける公開リソースを網羅した包括的なマルチドメイン・マルチモーダルデータセットを構築した。さらに、我々が独自にキュレートしたOCR集中型データセットとSet-of-Markデータセットを追加し、多様性と汎用性を拡張した。TinyLlama1.1B、InternLM2-7B、LLaMA2-13B、Mixtral8x7Bといった異なる基盤LLMをトレーニングすることで、パラメータサイズと多言語能力が異なるMLLMのスペクトルを獲得した。包括的なベンチマークにより、マルチモーダル性能とデータおよびパラメータスケールとの間に強い相関があることが明らかになった。コードとモデルはhttps://github.com/Alpha-VLLM/LLaMA2-Accessoryで公開されている。

English

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

SPHINX-X：マルチモーダル大規模言語モデルファミリーのためのデータとパラメータのスケーリング

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

要旨

Support