xGen-MM(BLIP-3):一类开放的大型多模态模型族
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
August 16, 2024
作者: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
cs.AI
摘要
本报告介绍了 xGen-MM(也称为 BLIP-3),这是一个用于开发大型多模态模型(LMMs)的框架。该框架包括精心策划的数据集、训练配方、模型架构以及一系列 LMMs。xGen-MM,即 xGen-MultiModal,扩展了Salesforce xGen在基础AI模型上的倡议。我们的模型经过严格评估,涵盖各种任务,包括单图和多图基准测试。我们的预训练基础模型展现出强大的上下文学习能力,而经过指导调整的模型在与类似模型规模的开源LMMs中表现出竞争力。此外,我们引入了一个带有DPO的安全调整模型,旨在减轻诸如幻觉之类的有害行为并提高安全性。我们将我们的模型、精心策划的大规模数据集以及微调代码库开源,以促进LMM研究的进一步发展。相关资源将在我们的项目页面上提供。
English
This report introduces xGen-MM (also known as BLIP-3), a framework for
developing Large Multimodal Models (LMMs). The framework comprises meticulously
curated datasets, a training recipe, model architectures, and a resulting suite
of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen
initiative on foundation AI models. Our models undergo rigorous evaluation
across a range of tasks, including both single and multi-image benchmarks. Our
pre-trained base model exhibits strong in-context learning capabilities and the
instruction-tuned model demonstrates competitive performance among open-source
LMMs with similar model sizes. In addition, we introduce a safety-tuned model
with DPO, aiming to mitigate harmful behaviors such as hallucinations and
improve safety. We open-source our models, curated large-scale datasets, and
our fine-tuning codebase to facilitate further advancements in LMM research.
Associated resources will be available on our project page above.Summary
AI-Generated Summary