ChatPaper.aiChatPaper

Qwen-Image技术报告

Qwen-Image Technical Report

August 4, 2025
作者: Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
cs.AI

摘要

我们推出Qwen-Image,作为Qwen系列中的图像生成基础模型,在复杂文本渲染与精准图像编辑方面取得了显著进展。针对复杂文本渲染的挑战,我们设计了一套全面的数据流程,涵盖大规模数据收集、筛选、标注、合成与平衡。此外,采用渐进式训练策略,从非文本到文本渲染起步,由简至繁逐步引入文本输入,最终扩展至段落级描述。这一课程学习方法显著提升了模型的原生文本渲染能力。因此,Qwen-Image不仅在英语等字母语言上表现卓越,在更具挑战性的汉字等表意文字语言上也取得了显著进步。为增强图像编辑的一致性,我们引入了一种改进的多任务训练范式,不仅包含传统的文本到图像(T2I)和文本图像到图像(TI2I)任务,还整合了图像到图像(I2I)重建任务,有效对齐了Qwen2.5-VL与MMDiT之间的潜在表示。更进一步,我们分别将原始图像输入Qwen2.5-VL和VAE编码器,以获取语义与重建表示。这种双编码机制使编辑模块能够在保持语义一致性与视觉保真度之间找到平衡。Qwen-Image在多个基准测试中展现了顶尖性能,证明了其在图像生成与编辑方面的强大能力。
English
We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
PDF1342August 5, 2025