SPHINX: 다중 모달 대형 언어 모델을 위한 가중치, 작업 및 시각 임베딩의 통합 혼합

초록

우리는 모델 가중치, 튜닝 작업, 시각적 임베딩을 결합한 다목적 멀티모달 대형 언어 모델(MLLM)인 SPHINX를 소개합니다. 먼저, 더 강력한 시각-언어 정렬을 위해 사전 학습 중 대형 언어 모델(LLM)의 가중치를 고정 해제하고, 실제 데이터와 합성 데이터로 학습된 LLM 간의 가중치 혼합 전략을 도입합니다. 두 도메인의 가중치를 직접 통합함으로써, 혼합된 LLM은 다양한 의미를 효율적으로 통합하고 우수한 견고성을 갖출 수 있습니다. 다음으로, 다목적 기능을 가능하게 하기 위해 다양한 작업을 결합한 시각적 명령 튜닝을 수행하고, 작업 간 충돌을 방지하기 위해 작업별 명령어를 설계합니다. 기본적인 시각적 질의응답 외에도, 영역 수준 이해, 캡션 그라운딩, 문서 레이아웃 감지, 인간 포즈 추정과 같은 더 도전적인 작업을 포함하여 다양한 시나리오에서 상호 강화를 이끌어냅니다. 또한, 다양한 네트워크 아키텍처, 사전 학습 패러다임, 정보 세분화로부터 포괄적인 시각적 임베딩을 추출하는 방법을 제안하여, 언어 모델에 더 견고한 이미지 표현을 제공합니다. 우리가 제안한 결합 혼합 방식을 기반으로, SPHINX는 다양한 애플리케이션에서 우수한 멀티모달 이해 능력을 보여줍니다. 이를 바탕으로, 고해상도 이미지의 세밀한 외관을 더 잘 포착하기 위한 효율적인 전략을 추가로 제안합니다. 다양한 스케일과 고해상도 하위 이미지를 혼합함으로써, SPHINX는 기존 평가 벤치마크에서 탁월한 시각적 파싱 및 추론 성능을 달성합니다. 우리의 작업이 향후 MLLM 연구에서 결합 혼합 탐구에 대한 통찰을 제공하기를 바랍니다. 코드는 https://github.com/Alpha-VLLM/LLaMA2-Accessory에서 공개되었습니다.

English

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

SPHINX: 다중 모달 대형 언어 모델을 위한 가중치, 작업 및 시각 임베딩의 통합 혼합

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

초록

Support