SPHINX：用于多模态大型语言模型的权重、任务和视觉嵌入的联合混合

摘要

我们提出了SPHINX，这是一个多功能的多模态大型语言模型（MLLM），具有模型权重、调整任务和视觉嵌入的联合混合。首先，为了实现更强的视觉-语言对齐，我们在预训练期间解冻了大型语言模型（LLM），并引入了在真实数据和合成数据上训练的LLM之间的权重混合策略。通过直接整合两个领域的权重，混合LLM可以有效地融合多样的语义，具有良好的鲁棒性。然后，为了实现多功能能力，我们混合了各种任务进行联合视觉指导调整，并设计了任务特定的指导以避免任务间的冲突。除了基本的视觉问答，我们还包括了更具挑战性的任务，如区域级理解、标题定位、文档布局检测和人体姿势估计，有助于在不同场景下相互增强。此外，我们提议从各种网络架构、预训练范式和信息粒度中提取全面的视觉嵌入，为语言模型提供更强大的图像表示。基于我们提出的联合混合，SPHINX在各种应用中展现出卓越的多模态理解能力。除此之外，我们进一步提出了一种旨在更好地捕捉高分辨率图像细粒度外观的高效策略。通过混合不同尺度和高分辨率子图像，SPHINX在现有评估基准上实现了出色的视觉解析和推理性能。我们希望我们的工作能为未来MLLM研究中的联合混合探索投下一线光芒。代码已发布在https://github.com/Alpha-VLLM/LLaMA2-Accessory。

English

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

SPHINX：用于多模态大型语言模型的权重、任务和视觉嵌入的联合混合

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

摘要

Support