xGen-MM (BLIP-3): 개방형 대형 다중 모달 모델 패밀리

초록

본 보고서는 대규모 다중모달 모델(LMMs) 개발을 위한 xGen-MM(또는 BLIP-3로도 알려짐) 프레임워크를 소개합니다. 이 프레임워크는 면밀히 선별된 데이터셋, 학습 레시피, 모델 아키텍처 및 다양한 LMMs로 구성됩니다. xGen-MM은 xGen-MultiModal의 약칭으로, Salesforce xGen 이니셔티어의 AI 모델에 대한 확장입니다. 저희 모델은 단일 및 다중 이미지 벤치마크를 포함한 다양한 작업들에서 엄격한 평가를 거쳤습니다. 사전 학습된 기본 모델은 강력한 문맥 학습 능력을 보이며, 인스트럭션 튜닝된 모델은 유사한 모델 크기의 오픈 소스 LMMs 사이에서 경쟁력 있는 성능을 나타냅니다. 더불어, 우리는 DPO를 활용한 안전 튜닝 모델을 소개하여 환각과 같은 유해한 행동을 완화하고 안전성을 향상시키고자 합니다. 우리는 우리의 모델, 선별된 대규모 데이터셋 및 파인튜닝 코드베이스를 오픈 소스로 공개하여 LMM 연구의 더 나은 발전을 촉진합니다. 관련 자료는 위의 프로젝트 페이지에서 제공될 예정입니다.

English

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.