PixArt-α：フォトリアルなテキストから画像生成のための拡散トランスフォーマーの高速トレーニング

要旨

最先端のテキストから画像（T2I）モデルは、多大なトレーニングコスト（例：数百万GPU時間）を必要とし、AIGCコミュニティの根本的な革新を著しく妨げると同時に、CO2排出量を増加させています。本論文では、TransformerベースのT2I拡散モデルであるPIXART-alphaを紹介します。このモデルは、画像生成品質において最先端の画像生成器（例：Imagen、SDXL、さらにはMidjourney）と競合し、商用アプリケーション基準に近いレベルに達しています。さらに、図1および図2に示すように、低いトレーニングコストで最大1024pxの高解像度画像合成をサポートします。この目標を達成するために、3つのコア設計が提案されています：（1）トレーニング戦略の分解：ピクセル依存性、テキストと画像の整合性、および画像の美的品質を個別に最適化する3つの異なるトレーニングステップを考案します。（2）効率的なT2I Transformer：テキスト条件を注入し、計算集約的なクラス条件ブランチを合理化するために、Diffusion Transformer（DiT）にクロスアテンションモジュールを組み込みます。（3）高情報量データ：テキストと画像のペアにおける概念密度の重要性を強調し、大規模なVision-Languageモデルを活用して、テキストと画像の整合性学習を支援するための密な擬似キャプションを自動ラベル付けします。その結果、PIXART-alphaのトレーニング速度は既存の大規模T2Iモデルを大幅に上回り、例えば、PIXART-alphaはStable Diffusion v1.5のトレーニング時間の10.8％（675 vs. 6,250 A100 GPU日）しかかからず、約\300,000（26,000 vs. \320,000）を節約し、CO2排出量を90％削減します。さらに、より大規模なSOTAモデルであるRAPHAELと比較して、我々のトレーニングコストはわずか1％です。広範な実験により、PIXART-alphaは画像品質、芸術性、および意味的制御において優れていることが実証されています。我々は、PIXART-alphaがAIGCコミュニティやスタートアップに新たな洞察を提供し、高品質かつ低コストの生成モデルをゼロから構築することを加速することを期待しています。

English

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

PixArt-α：フォトリアルなテキストから画像生成のための拡散トランスフォーマーの高速トレーニング

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

要旨

Support