SPHINX：用於多模式大型語言模型的權重、任務和視覺嵌入的聯合混合

摘要

我們提出了SPHINX，一個多功能的多模式大型語言模型（MLLM），具有模型權重、調整任務和視覺嵌入的聯合混合。首先，為了加強視覺-語言對齊，我們在預訓練期間解凍了大型語言模型（LLM），並引入了一種權重混合策略，用於在真實世界和合成數據上訓練的LLM之間。通過直接整合來自兩個領域的權重，混合LLM可以有效地融合多樣的語義，具有良好的韌性。然後，為了實現多功能能力，我們混合了各種任務進行聯合視覺指導調整，並設計了任務特定的指導，以避免任務間的衝突。除了基本的視覺問答外，我們還包括了更具挑戰性的任務，如區域級理解、標題對應、文檔布局檢測和人體姿勢估計，有助於在不同情境下相互增強。此外，我們提出從各種網絡架構、預訓練範式和信息細粒度中提取全面的視覺嵌入，為語言模型提供更強大的圖像表示。基於我們提出的聯合混合，SPHINX在各種應用中展現出卓越的多模式理解能力。除此之外，我們進一步提出了一種旨在更好捕捉高分辨率圖像細微外觀的高效策略。通過混合不同尺度和高分辨率子圖像，SPHINX在現有評估基準上實現了出色的視覺解析和推理性能。我們希望我們的工作可以為未來MLLM研究中的聯合混合探索提供一些啟示。代碼已在https://github.com/Alpha-VLLM/LLaMA2-Accessory 上發布。

English

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

SPHINX：用於多模式大型語言模型的權重、任務和視覺嵌入的聯合混合

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

摘要

Support