xGen-MM（BLIP-3）：一個開放的大型多模型模型家族

摘要

本報告介紹了 xGen-MM（又稱為 BLIP-3），這是一個用於開發大型多模型模型（LMMs）的框架。該框架包括精心策劃的數據集、訓練配方、模型架構以及一系列的LMMs。xGen-MM，即xGen-MultiModal，擴展了Salesforce xGen在基礎AI模型上的倡議。我們的模型經過嚴格評估，涵蓋各種任務，包括單圖和多圖基準測試。我們的預訓練基本模型展現出強大的上下文學習能力，並且調整指令的模型在與類似模型大小的開源LMMs中展現出競爭力。此外，我們引入了一個帶有DPO的安全調整模型，旨在減輕如幻覺等有害行為並提高安全性。我們將我們的模型、策劃的大規模數據集以及微調代碼庫開源，以促進LMM研究的進一步發展。相關資源將在我們的專案頁面上提供。

English

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

xGen-MM（BLIP-3）：一個開放的大型多模型模型家族

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

摘要

Support