PixArt-α:用于光真实文本到图像合成的扩散Transformer的快速训练
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
September 30, 2023
作者: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie1, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li
cs.AI
摘要
目前最先进的文本到图像(T2I)模型需要大量的训练成本(例如,数百万 GPU 小时),严重阻碍了AIGC社区的基础创新,同时增加了二氧化碳排放量。本文介绍了PIXART-alpha,这是一种基于Transformer的T2I扩散模型,其图像生成质量与最先进的图像生成器(例如Imagen、SDXL,甚至Midjourney)相媲美,达到接近商业应用标准。此外,它支持高分辨率图像合成,最高可达1024像素分辨率,且训练成本低,如图1和2所示。为实现这一目标,提出了三个核心设计:(1)训练策略分解:我们设计了三个不同的训练步骤,分别优化像素依赖性、文本-图像对齐和图像美学质量;(2)高效T2I Transformer:我们将交叉注意力模块整合到扩散Transformer(DiT)中,以注入文本条件并简化计算密集型的类别条件分支;(3)高信息量数据:我们强调文本-图像对中概念密度的重要性,并利用大型视觉-语言模型自动标记密集的伪标题,以辅助文本-图像对齐学习。因此,PIXART-alpha的训练速度明显超过现有的大规模T2I模型,例如,PIXART-alpha仅需稳定扩散v1.5的训练时间的10.8%(675 vs. 6,250 A100 GPU天),节省了近300,000美元(26,000 vs. 320,000美元),减少了90%的二氧化碳排放。此外,与更大的SOTA模型RAPHAEL相比,我们的训练成本仅为1%。大量实验证明,PIXART-alpha在图像质量、艺术性和语义控制方面表现出色。我们希望PIXART-alpha能为AIGC社区和初创公司提供新的见解,加速他们从零开始构建自己的高质量且低成本的生成模型。
English
The most advanced text-to-image (T2I) models require significant training
costs (e.g., millions of GPU hours), seriously hindering the fundamental
innovation for the AIGC community while increasing CO2 emissions. This paper
introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image
generation quality is competitive with state-of-the-art image generators (e.g.,
Imagen, SDXL, and even Midjourney), reaching near-commercial application
standards. Additionally, it supports high-resolution image synthesis up to
1024px resolution with low training cost, as shown in Figure 1 and 2. To
achieve this goal, three core designs are proposed: (1) Training strategy
decomposition: We devise three distinct training steps that separately optimize
pixel dependency, text-image alignment, and image aesthetic quality; (2)
Efficient T2I Transformer: We incorporate cross-attention modules into
Diffusion Transformer (DiT) to inject text conditions and streamline the
computation-intensive class-condition branch; (3) High-informative data: We
emphasize the significance of concept density in text-image pairs and leverage
a large Vision-Language model to auto-label dense pseudo-captions to assist
text-image alignment learning. As a result, PIXART-alpha's training speed
markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only
takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU
days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2
emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training
cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha
excels in image quality, artistry, and semantic control. We hope
PIXART-\alpha$ will provide new insights to the AIGC community and startups to
accelerate building their own high-quality yet low-cost generative models from
scratch.