ChatPaper.aiChatPaper

BLIP3-o:一個完全開放的統一多模態模型家族——架構、訓練與數據集

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

May 14, 2025
作者: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
cs.AI

摘要

統一圖像理解與生成在近年多模態模型研究中日益受到關注。儘管圖像理解的設計選擇已得到廣泛研究,但在統一框架下結合圖像生成的最佳模型架構與訓練方案仍待深入探索。基於自迴歸模型和擴散模型在高質量生成與可擴展性方面的強大潛力,我們對其在統一多模態設置中的應用進行了全面研究,重點關注圖像表示、建模目標和訓練策略。基於這些研究,我們提出了一種新方法,採用擴散變壓器生成語義豐富的CLIP圖像特徵,與傳統基於VAE的表示形成對比。這一設計既提高了訓練效率,又提升了生成質量。此外,我們證明,對於統一模型採用分階段預訓練策略——先進行圖像理解訓練,再進行圖像生成訓練——能夠在保持圖像理解能力的同時,培養強大的圖像生成能力,具有實際優勢。最後,我們精心策劃了一個高質量的指令微調數據集BLIP3o-60k,用於圖像生成,通過向GPT-4o提供涵蓋多樣場景、物體、人類手勢等的多樣化標題來生成數據。基於我們創新的模型設計、訓練方案和數據集,我們開發了BLIP3-o,一套領先的統一多模態模型。BLIP3-o在涵蓋圖像理解與生成任務的眾多流行基準測試中均表現出色。為促進未來研究,我們完全開源了我們的模型,包括代碼、模型權重、訓練腳本,以及預訓練和指令微調數據集。
English
Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

Summary

AI-Generated Summary

PDF493May 15, 2025