PixArt-Σ:弱到強訓練擴散Transformer以進行4K文本到圖像生成
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
March 7, 2024
作者: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li
cs.AI
摘要
本文介紹了PixArt-\Sigma,一種Diffusion Transformer模型(DiT),能夠直接生成4K解析度的圖像。PixArt-\Sigma相較於其前身PixArt-\alpha,代表了一個重大的進步,提供了品質明顯更高且與文本提示更好對齊的圖像。PixArt-\Sigma的一個關鍵特點是其訓練效率。利用PixArt-\alpha的基礎預訓練,通過納入更高質量的數據,我們將其從“較弱”的基準發展為“較強”的模型,這一過程我們稱之為“弱到強訓練”。PixArt-\Sigma的進步有兩方面:(1)高質量訓練數據:PixArt-\Sigma納入了優質的圖像數據,配以更精確和詳細的圖像標題。(2)高效Token壓縮:我們在DiT框架中提出了一個新穎的注意力模塊,可以壓縮鍵和值,顯著提高效率並促進超高解析度圖像生成。由於這些改進,PixArt-\Sigma實現了優越的圖像品質和用戶提示遵循能力,並且比現有的文本到圖像擴散模型(如SDXL(2.6B參數)和SD Cascade(5.1B參數))具有明顯更小的模型大小(0.6B參數)。此外,PixArt-\Sigma生成4K圖像的能力支持高解析度海報和桌布的創建,有效地促進了在電影和遊戲等行業中高質量視覺內容的生產。
English
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer
model~(DiT) capable of directly generating images at 4K resolution.
PixArt-\Sigma represents a significant advancement over its predecessor,
PixArt-\alpha, offering images of markedly higher fidelity and improved
alignment with text prompts. A key feature of PixArt-\Sigma is its training
efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it
evolves from the `weaker' baseline to a `stronger' model via incorporating
higher quality data, a process we term "weak-to-strong training". The
advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data:
PixArt-\Sigma incorporates superior-quality image data, paired with more
precise and detailed image captions. (2) Efficient Token Compression: we
propose a novel attention module within the DiT framework that compresses both
keys and values, significantly improving efficiency and facilitating
ultra-high-resolution image generation. Thanks to these improvements,
PixArt-\Sigma achieves superior image quality and user prompt adherence
capabilities with significantly smaller model size (0.6B parameters) than
existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD
Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K
images supports the creation of high-resolution posters and wallpapers,
efficiently bolstering the production of high-quality visual content in
industries such as film and gaming.