Qwen-Image 技術報告
Qwen-Image Technical Report
August 4, 2025
作者: Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
cs.AI
摘要
我們推出Qwen-Image,這是Qwen系列中的一個圖像生成基礎模型,在複雜文本渲染和精確圖像編輯方面取得了顯著進展。為應對複雜文本渲染的挑戰,我們設計了一個全面的數據管道,包括大規模數據收集、過濾、註釋、合成和平衡。此外,我們採用了一種漸進式訓練策略,從非文本到文本渲染開始,從簡單到複雜的文本輸入逐步演進,並最終擴展到段落級描述。這種課程學習方法顯著增強了模型的原生文本渲染能力。因此,Qwen-Image不僅在英語等字母語言中表現出色,還在更具挑戰性的象形文字語言(如中文)上取得了顯著進展。為提升圖像編輯的一致性,我們引入了一種改進的多任務訓練範式,不僅包含傳統的文本到圖像(T2I)和文本圖像到圖像(TI2I)任務,還包括圖像到圖像(I2I)重建,有效地對齊了Qwen2.5-VL和MMDiT之間的潛在表示。此外,我們分別將原始圖像輸入Qwen2.5-VL和VAE編碼器,以獲取語義和重建表示。這種雙重編碼機制使編輯模塊能夠在保持語義一致性和視覺保真度之間取得平衡。Qwen-Image在多個基準測試中達到了最先進的性能,展示了其在圖像生成和編輯方面的強大能力。
English
We present Qwen-Image, an image generation foundation model in the Qwen
series that achieves significant advances in complex text rendering and precise
image editing. To address the challenges of complex text rendering, we design a
comprehensive data pipeline that includes large-scale data collection,
filtering, annotation, synthesis, and balancing. Moreover, we adopt a
progressive training strategy that starts with non-text-to-text rendering,
evolves from simple to complex textual inputs, and gradually scales up to
paragraph-level descriptions. This curriculum learning approach substantially
enhances the model's native text rendering capabilities. As a result,
Qwen-Image not only performs exceptionally well in alphabetic languages such as
English, but also achieves remarkable progress on more challenging logographic
languages like Chinese. To enhance image editing consistency, we introduce an
improved multi-task training paradigm that incorporates not only traditional
text-to-image (T2I) and text-image-to-image (TI2I) tasks but also
image-to-image (I2I) reconstruction, effectively aligning the latent
representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed
the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and
reconstructive representations, respectively. This dual-encoding mechanism
enables the editing module to strike a balance between preserving semantic
consistency and maintaining visual fidelity. Qwen-Image achieves
state-of-the-art performance, demonstrating its strong capabilities in both
image generation and editing across multiple benchmarks.