SPHINX: マルチモーダル大規模言語モデルのための重み、タスク、視覚的埋め込みの統合混合

要旨

我々は、モデル重み、チューニングタスク、視覚的埋め込みを共同で混合した多目的なマルチモーダル大規模言語モデル（MLLM）であるSPHINXを提案する。まず、より強力な視覚と言語の整合性を実現するため、事前学習中に大規模言語モデル（LLM）を凍結解除し、実世界データと合成データで訓練されたLLM間の重み混合戦略を導入する。二つのドメインからの重みを直接統合することで、混合LLMは多様な意味論を効率的に取り込み、良好なロバスト性を発揮する。次に、多目的な能力を可能にするため、共同視覚指示チューニングのために様々なタスクを混合し、タスク間の衝突を避けるためにタスク固有の指示を設計する。基本的な視覚的質問応答に加えて、領域レベルの理解、キャプションのグラウンディング、ドキュメントレイアウト検出、人間の姿勢推定など、より挑戦的なタスクを含めることで、異なるシナリオ間での相互強化に貢献する。さらに、様々なネットワークアーキテクチャ、事前学習パラダイム、情報粒度から包括的な視覚的埋め込みを抽出し、言語モデルによりロバストな画像表現を提供する。我々が提案する共同混合に基づき、SPHINXは幅広いアプリケーションにおいて優れたマルチモーダル理解能力を示す。これに加えて、高解像度画像の細かい外観をより良く捉えるための効率的な戦略をさらに提案する。異なるスケールと高解像度のサブ画像を混合することで、SPHINXは既存の評価ベンチマークで卓越した視覚的解析と推論性能を達成する。我々の研究が、将来のMLLM研究における共同混合の探求に光を当てることを期待する。コードはhttps://github.com/Alpha-VLLM/LLaMA2-Accessoryで公開されている。

English

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

SPHINX: マルチモーダル大規模言語モデルのための重み、タスク、視覚的埋め込みの統合混合

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

要旨

Support