Qwen-Image 技術報告

摘要

我們推出Qwen-Image，這是Qwen系列中的一個圖像生成基礎模型，在複雜文本渲染和精確圖像編輯方面取得了顯著進展。為應對複雜文本渲染的挑戰，我們設計了一個全面的數據管道，包括大規模數據收集、過濾、註釋、合成和平衡。此外，我們採用了一種漸進式訓練策略，從非文本到文本渲染開始，從簡單到複雜的文本輸入逐步演進，並最終擴展到段落級描述。這種課程學習方法顯著增強了模型的原生文本渲染能力。因此，Qwen-Image不僅在英語等字母語言中表現出色，還在更具挑戰性的象形文字語言（如中文）上取得了顯著進展。為提升圖像編輯的一致性，我們引入了一種改進的多任務訓練範式，不僅包含傳統的文本到圖像（T2I）和文本圖像到圖像（TI2I）任務，還包括圖像到圖像（I2I）重建，有效地對齊了Qwen2.5-VL和MMDiT之間的潛在表示。此外，我們分別將原始圖像輸入Qwen2.5-VL和VAE編碼器，以獲取語義和重建表示。這種雙重編碼機制使編輯模塊能夠在保持語義一致性和視覺保真度之間取得平衡。Qwen-Image在多個基準測試中達到了最先進的性能，展示了其在圖像生成和編輯方面的強大能力。

English

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

Qwen-Image 技術報告

Qwen-Image Technical Report

摘要

Support