xGen-MM（BLIP-3）：オープンな大規模マルチモーダルモデルのファミリー

要旨

本報告書では、大規模マルチモーダルモデル（LMM）の開発フレームワークであるxGen-MM（別名BLIP-3）を紹介します。このフレームワークは、厳選されたデータセット、トレーニングレシピ、モデルアーキテクチャ、および結果として得られる一連のLMMで構成されています。xGen-MM（xGen-MultiModalの略）は、Salesforceの基盤AIモデルに関するxGenイニシアチブを拡張するものです。私たちのモデルは、単一画像および複数画像のベンチマークを含む様々なタスクにおいて厳密な評価を受けています。事前学習済みのベースモデルは、強力なインコンテキスト学習能力を示し、指示チューニングされたモデルは、同規模のオープンソースLMMの中で競争力のある性能を発揮します。さらに、DPOを用いた安全性チューニングモデルを導入し、幻覚などの有害な行動を軽減し、安全性を向上させることを目指しています。LMM研究のさらなる進展を促進するため、私たちはモデル、厳選された大規模データセット、およびファインチューニングコードベースをオープンソースとして公開します。関連リソースは、上記のプロジェクトページで利用可能になります。

English

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

xGen-MM（BLIP-3）：オープンな大規模マルチモーダルモデルのファミリー

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

要旨

Support