Qwen-Image技術レポート

要旨

我々は、Qwenシリーズにおける画像生成基盤モデルであるQwen-Imageを紹介する。このモデルは、複雑なテキストレンダリングと精密な画像編集において重要な進展を達成している。複雑なテキストレンダリングの課題に対処するため、大規模なデータ収集、フィルタリング、アノテーション、合成、バランス調整を含む包括的なデータパイプラインを設計した。さらに、非テキストからテキストへのレンダリングを始め、単純なテキスト入力から複雑なテキスト入力へと進化し、段階的に段落レベルの記述にスケールアップするプログレッシブなトレーニング戦略を採用した。このカリキュラム学習アプローチにより、モデルのネイティブなテキストレンダリング能力が大幅に向上した。その結果、Qwen-Imageは英語などのアルファベット言語において優れた性能を発揮するだけでなく、中国語のようなより挑戦的な表意文字言語においても顕著な進歩を達成した。画像編集の一貫性を向上させるため、従来のテキストから画像（T2I）やテキスト画像から画像（TI2I）タスクに加え、画像から画像（I2I）再構築も取り入れた改良されたマルチタスクトレーニングパラダイムを導入し、Qwen2.5-VLとMMDiTの潜在表現を効果的に整合させた。さらに、元の画像をQwen2.5-VLとVAEエンコーダに別々に供給し、それぞれ意味的表現と再構築的表現を取得した。この二重エンコーディングメカニズムにより、編集モジュールは意味的一貫性の維持と視覚的忠実度の維持のバランスを取ることが可能となった。Qwen-Imageは、複数のベンチマークにおいて画像生成と編集の両方で最先端の性能を達成し、その強力な能力を実証している。

English

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

Qwen-Image技術レポート

Qwen-Image Technical Report

要旨

Support