BLIP3-o:全开放统一多模态模型家族——架构、训练与数据集
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
May 14, 2025
作者: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
cs.AI
摘要
近年来,在多模态模型的研究中,统一图像理解与生成日益受到关注。尽管图像理解的设计选择已得到广泛探讨,但在统一框架下结合图像生成的最佳模型架构与训练方案仍待深入探索。鉴于自回归模型和扩散模型在高质量生成与可扩展性方面的强大潜力,我们对其在统一多模态场景中的应用进行了全面研究,重点关注图像表示、建模目标及训练策略。基于这些研究,我们提出了一种新颖方法,采用扩散变换器生成语义丰富的CLIP图像特征,与传统的基于VAE的表示形成对比。这一设计既提升了训练效率,又改善了生成质量。此外,我们证明了统一模型采用分阶段预训练策略——先进行图像理解训练,再进行图像生成训练——具有实际优势,能在保持图像理解能力的同时,培养出强大的图像生成能力。最后,我们精心构建了一个高质量的指令微调数据集BLIP3o-60k,通过向GPT-4o提供涵盖多样场景、物体、人类手势等的描述,专门用于图像生成。依托于我们创新的模型设计、训练方案及数据集,我们开发了BLIP3-o,一套领先的统一多模态模型。BLIP3-o在涵盖图像理解与生成任务的主流基准测试中均表现出色。为促进未来研究,我们全面开源了模型,包括代码、模型权重、训练脚本以及预训练和指令微调数据集。
English
Unifying image understanding and generation has gained growing attention in
recent research on multimodal models. Although design choices for image
understanding have been extensively studied, the optimal model architecture and
training recipe for a unified framework with image generation remain
underexplored. Motivated by the strong potential of autoregressive and
diffusion models for high-quality generation and scalability, we conduct a
comprehensive study of their use in unified multimodal settings, with emphasis
on image representations, modeling objectives, and training strategies.
Grounded in these investigations, we introduce a novel approach that employs a
diffusion transformer to generate semantically rich CLIP image features, in
contrast to conventional VAE-based representations. This design yields both
higher training efficiency and improved generative quality. Furthermore, we
demonstrate that a sequential pretraining strategy for unified models-first
training on image understanding and subsequently on image generation-offers
practical advantages by preserving image understanding capability while
developing strong image generation ability. Finally, we carefully curate a
high-quality instruction-tuning dataset BLIP3o-60k for image generation by
prompting GPT-4o with a diverse set of captions covering various scenes,
objects, human gestures, and more. Building on our innovative model design,
training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art
unified multimodal models. BLIP3-o achieves superior performance across most of
the popular benchmarks spanning both image understanding and generation tasks.
To facilitate future research, we fully open-source our models, including code,
model weights, training scripts, and pretraining and instruction tuning
datasets.Summary
AI-Generated Summary