BLIP3o-NEXT:原生图像生成的新前沿
BLIP3o-NEXT: Next Frontier of Native Image Generation
October 17, 2025
作者: Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu
cs.AI
摘要
我们推出BLIP3o-NEXT,作为BLIP3系列中一款完全开源的基础模型,它推动了原生图像生成技术的新前沿。BLIP3o-NEXT将文本到图像生成与图像编辑统一于单一架构之内,展现了强大的图像生成与编辑能力。在开发这一尖端原生图像生成模型的过程中,我们提炼出四大关键洞见:(1) 多数架构选择在性能上表现相近,只要架构能高效扩展并支持快速推理,即可视为有效;(2) 强化学习的成功应用能进一步拓展原生图像生成的边界;(3) 图像编辑仍具挑战性,但通过后训练与数据引擎,指令遵循及生成图像与参考图像间的一致性可显著提升;(4) 数据质量与规模依然是决定模型性能上限的决定性因素。基于这些洞见,BLIP3o-NEXT采用了自回归+扩散的架构,其中自回归模型首先基于多模态输入生成离散的图像令牌,其隐藏状态随后作为扩散模型的调节信号,以生成高保真图像。此架构融合了自回归模型的推理能力与指令遵循性,以及扩散模型的精细细节渲染能力,实现了前所未有的连贯性与真实感。在多项文本到图像及图像编辑基准测试中的广泛评估表明,BLIP3o-NEXT在性能上超越了现有模型。
English
We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3
series that advances the next frontier of native image generation. BLIP3o-NEXT
unifies text-to-image generation and image editing within a single
architecture, demonstrating strong image generation and image editing
capabilities. In developing the state-of-the-art native image generation model,
we identify four key insights: (1) Most architectural choices yield comparable
performance; an architecture can be deemed effective provided it scales
efficiently and supports fast inference; (2) The successful application of
reinforcement learning can further push the frontier of native image
generation; (3) Image editing still remains a challenging task, yet instruction
following and the consistency between generated and reference images can be
significantly enhanced through post-training and data engine; (4) Data quality
and scale continue to be decisive factors that determine the upper bound of
model performance. Building upon these insights, BLIP3o-NEXT leverages an
Autoregressive + Diffusion architecture in which an autoregressive model first
generates discrete image tokens conditioned on multimodal inputs, whose hidden
states are then used as conditioning signals for a diffusion model to generate
high-fidelity images. This architecture integrates the reasoning strength and
instruction following of autoregressive models with the fine-detail rendering
ability of diffusion models, achieving a new level of coherence and realism.
Extensive evaluations of various text-to-image and image-editing benchmarks
show that BLIP3o-NEXT achieves superior performance over existing models.