SPHINX-X: 다중 모드 대규모 언어 모델 패밀리를 위한 데이터 및 파라미터 스케일링

초록

본 논문에서는 SPHINX를 기반으로 개발된 광범위한 멀티모달리티 대형 언어 모델(MLLM) 시리즈인 SPHINX-X를 제안한다. 아키텍처 및 학습 효율성을 개선하기 위해, SPHINX 프레임워크를 수정하여 중복된 시각 인코더를 제거하고, 완전히 패딩된 부분 이미지를 스킵 토큰으로 우회하며, 다단계 학습을 단일 단계의 올인원 패러다임으로 단순화하였다. MLLM의 잠재력을 최대한 발휘하기 위해, 언어, 시각 및 시각-언어 작업에서 공개적으로 이용 가능한 리소스를 포함한 포괄적인 다중 도메인 및 다중 모달 데이터셋을 구축하였다. 또한, OCR 집중 및 Set-of-Mark 데이터셋을 추가하여 다양성과 일반성을 확장하였다. TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, Mixtral8x7B 등 다양한 기본 LLM을 통해 학습함으로써, 파라미터 크기와 다국어 능력이 다양한 MLLM 스펙트럼을 얻었다. 포괄적인 벤치마킹 결과, 다중 모달 성능과 데이터 및 파라미터 규모 간의 강한 상관관계가 확인되었다. 코드와 모델은 https://github.com/Alpha-VLLM/LLaMA2-Accessory에서 공개되었다.

English

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

SPHINX-X: 다중 모드 대규모 언어 모델 패밀리를 위한 데이터 및 파라미터 스케일링

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

초록

Support