ChatPaper.aiChatPaper

PixArt-Σ:弱到強訓練擴散Transformer以進行4K文本到圖像生成

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

March 7, 2024
作者: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li
cs.AI

摘要

本文介紹了PixArt-\Sigma,一種Diffusion Transformer模型(DiT),能夠直接生成4K解析度的圖像。PixArt-\Sigma相較於其前身PixArt-\alpha,代表了一個重大的進步,提供了品質明顯更高且與文本提示更好對齊的圖像。PixArt-\Sigma的一個關鍵特點是其訓練效率。利用PixArt-\alpha的基礎預訓練,通過納入更高質量的數據,我們將其從“較弱”的基準發展為“較強”的模型,這一過程我們稱之為“弱到強訓練”。PixArt-\Sigma的進步有兩方面:(1)高質量訓練數據:PixArt-\Sigma納入了優質的圖像數據,配以更精確和詳細的圖像標題。(2)高效Token壓縮:我們在DiT框架中提出了一個新穎的注意力模塊,可以壓縮鍵和值,顯著提高效率並促進超高解析度圖像生成。由於這些改進,PixArt-\Sigma實現了優越的圖像品質和用戶提示遵循能力,並且比現有的文本到圖像擴散模型(如SDXL(2.6B參數)和SD Cascade(5.1B參數))具有明顯更小的模型大小(0.6B參數)。此外,PixArt-\Sigma生成4K圖像的能力支持高解析度海報和桌布的創建,有效地促進了在電影和遊戲等行業中高質量視覺內容的生產。
English
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
PDF421December 15, 2024