ChatPaper.aiChatPaper

PixArt-α:用於逼真文本到圖像合成的Diffusion Transformer的快速訓練

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

September 30, 2023
作者: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie1, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li
cs.AI

摘要

目前最先進的文本到圖像(T2I)模型需要大量的訓練成本(例如,數百萬個GPU小時),嚴重阻礙了AIGC社區的基本創新,同時增加了二氧化碳排放。本文介紹了PIXART-alpha,一種基於Transformer的T2I擴散模型,其圖像生成質量與最先進的圖像生成器(例如Imagen、SDXL,甚至Midjourney)相媲美,達到接近商業應用標準。此外,它支持高分辨率圖像合成,最高可達1024像素分辨率,並具有低訓練成本,如圖1和2所示。為實現此目標,提出了三個核心設計:(1)訓練策略分解:我們設計了三個不同的訓練步驟,分別優化像素依賴性、文本-圖像對齊和圖像美學質量;(2)高效T2I Transformer:我們將交叉注意力模塊整合到擴散Transformer(DiT)中,以注入文本條件並簡化計算密集型的類別條件分支;(3)高信息數據:我們強調文本-圖像對中的概念密度的重要性,並利用大型視覺語言模型自動標記密集的虛擬標題,以幫助文本-圖像對齊學習。結果,PIXART-alpha的訓練速度明顯超過現有的大規模T2I模型,例如,PIXART-alpha僅需Stable Diffusion v1.5訓練時間的10.8%(675 vs. 6,250 A100 GPU天),節省近30萬美元(26,000 vs. 320,000美元),並減少90%的二氧化碳排放。此外,與更大的SOTA模型RAPHAEL相比,我們的訓練成本僅為1%。廣泛的實驗表明,PIXART-alpha在圖像質量、藝術性和語義控制方面表現卓越。我們希望PIXART-alpha能為AIGC社區和初創企業提供新的見解,加快從頭開始構建自己的高質量且低成本生成模型。
English
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
PDF6011December 15, 2024