PixArt-α：用于光真实文本到图像合成的扩散Transformer的快速训练

摘要

目前最先进的文本到图像（T2I）模型需要大量的训练成本（例如，数百万 GPU 小时），严重阻碍了AIGC社区的基础创新，同时增加了二氧化碳排放量。本文介绍了PIXART-alpha，这是一种基于Transformer的T2I扩散模型，其图像生成质量与最先进的图像生成器（例如Imagen、SDXL，甚至Midjourney）相媲美，达到接近商业应用标准。此外，它支持高分辨率图像合成，最高可达1024像素分辨率，且训练成本低，如图1和2所示。为实现这一目标，提出了三个核心设计：（1）训练策略分解：我们设计了三个不同的训练步骤，分别优化像素依赖性、文本-图像对齐和图像美学质量；（2）高效T2I Transformer：我们将交叉注意力模块整合到扩散Transformer（DiT）中，以注入文本条件并简化计算密集型的类别条件分支；（3）高信息量数据：我们强调文本-图像对中概念密度的重要性，并利用大型视觉-语言模型自动标记密集的伪标题，以辅助文本-图像对齐学习。因此，PIXART-alpha的训练速度明显超过现有的大规模T2I模型，例如，PIXART-alpha仅需稳定扩散v1.5的训练时间的10.8%（675 vs. 6,250 A100 GPU天），节省了近300,000美元（26,000 vs. 320,000美元），减少了90%的二氧化碳排放。此外，与更大的SOTA模型RAPHAEL相比，我们的训练成本仅为1%。大量实验证明，PIXART-alpha在图像质量、艺术性和语义控制方面表现出色。我们希望PIXART-alpha能为AIGC社区和初创公司提供新的见解，加速他们从零开始构建自己的高质量且低成本的生成模型。

English

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

PixArt-α：用于光真实文本到图像合成的扩散Transformer的快速训练

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

摘要

Support