ChatPaper.aiChatPaper

BLIP3o-NEXT:原生图像生成的新前沿

BLIP3o-NEXT: Next Frontier of Native Image Generation

October 17, 2025
作者: Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu
cs.AI

摘要

我们推出BLIP3o-NEXT,作为BLIP3系列中一款完全开源的基础模型,它推动了原生图像生成技术的新前沿。BLIP3o-NEXT将文本到图像生成与图像编辑统一于单一架构之内,展现了强大的图像生成与编辑能力。在开发这一尖端原生图像生成模型的过程中,我们提炼出四大关键洞见:(1) 多数架构选择在性能上表现相近,只要架构能高效扩展并支持快速推理,即可视为有效;(2) 强化学习的成功应用能进一步拓展原生图像生成的边界;(3) 图像编辑仍具挑战性,但通过后训练与数据引擎,指令遵循及生成图像与参考图像间的一致性可显著提升;(4) 数据质量与规模依然是决定模型性能上限的决定性因素。基于这些洞见,BLIP3o-NEXT采用了自回归+扩散的架构,其中自回归模型首先基于多模态输入生成离散的图像令牌,其隐藏状态随后作为扩散模型的调节信号,以生成高保真图像。此架构融合了自回归模型的推理能力与指令遵循性,以及扩散模型的精细细节渲染能力,实现了前所未有的连贯性与真实感。在多项文本到图像及图像编辑基准测试中的广泛评估表明,BLIP3o-NEXT在性能上超越了现有模型。
English
We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
PDF202October 20, 2025