BLIP3o-NEXT:原生圖像生成的新疆界
BLIP3o-NEXT: Next Frontier of Native Image Generation
October 17, 2025
作者: Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu
cs.AI
摘要
我們推出BLIP3o-NEXT,作為BLIP3系列中完全開源的基礎模型,它推動了原生圖像生成的下一個前沿。BLIP3o-NEXT在單一架構內統一了文本到圖像生成與圖像編輯功能,展現出強大的圖像生成與編輯能力。在開發這一尖端原生圖像生成模型的過程中,我們總結了四點關鍵洞見:(1)多數架構選擇能帶來相當的性能;只要架構能高效擴展並支持快速推理,即可視為有效;(2)強化學習的成功應用能進一步拓展原生圖像生成的邊界;(3)圖像編輯仍具挑戰性,但通過後訓練與數據引擎,指令遵循及生成圖像與參考圖像間的一致性可顯著提升;(4)數據質量與規模依然是決定模型性能上限的關鍵因素。基於這些洞見,BLIP3o-NEXT採用了自迴歸+擴散架構,其中自迴歸模型首先根據多模態輸入生成離散圖像標記,其隱藏狀態隨後作為條件信號供擴散模型生成高保真圖像。此架構融合了自迴歸模型的推理能力與指令遵循性,以及擴散模型的精細細節渲染能力,達到了新的連貫性與真實感水平。在多項文本到圖像及圖像編輯基準測試中的廣泛評估表明,BLIP3o-NEXT在性能上超越了現有模型。
English
We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3
series that advances the next frontier of native image generation. BLIP3o-NEXT
unifies text-to-image generation and image editing within a single
architecture, demonstrating strong image generation and image editing
capabilities. In developing the state-of-the-art native image generation model,
we identify four key insights: (1) Most architectural choices yield comparable
performance; an architecture can be deemed effective provided it scales
efficiently and supports fast inference; (2) The successful application of
reinforcement learning can further push the frontier of native image
generation; (3) Image editing still remains a challenging task, yet instruction
following and the consistency between generated and reference images can be
significantly enhanced through post-training and data engine; (4) Data quality
and scale continue to be decisive factors that determine the upper bound of
model performance. Building upon these insights, BLIP3o-NEXT leverages an
Autoregressive + Diffusion architecture in which an autoregressive model first
generates discrete image tokens conditioned on multimodal inputs, whose hidden
states are then used as conditioning signals for a diffusion model to generate
high-fidelity images. This architecture integrates the reasoning strength and
instruction following of autoregressive models with the fine-detail rendering
ability of diffusion models, achieving a new level of coherence and realism.
Extensive evaluations of various text-to-image and image-editing benchmarks
show that BLIP3o-NEXT achieves superior performance over existing models.