PixArt-Σ:Diffusion Transformer的弱到强训练用于4K文本到图像生成。
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
March 7, 2024
作者: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li
cs.AI
摘要
本文介绍了PixArt-\Sigma,一种Diffusion Transformer模型(DiT),能够直接生成4K分辨率的图像。PixArt-\Sigma相比其前身PixArt-\alpha有了显著进步,提供了质量更高且与文本提示更好对齐的图像。PixArt-\Sigma的一个关键特点是其训练效率。利用PixArt-\alpha的基础预训练,它通过融合更高质量的数据从“较弱”的基线发展为“更强”的模型,这一过程我们称之为“弱到强训练”。PixArt-\Sigma的进步有两个方面:(1)高质量训练数据:PixArt-\Sigma融合了优质的图像数据,并配以更精确和详细的图像标题。(2)高效Token压缩:我们在DiT框架内提出了一个新颖的注意力模块,可以压缩键和值,显著提高效率,并促进超高分辨率图像生成。由于这些改进,PixArt-\Sigma在模型尺寸明显更小(0.6B参数)的情况下,实现了优越的图像质量和用户提示遵从能力,远胜于现有的文本到图像扩散模型,如SDXL(2.6B参数)和SD Cascade(5.1B参数)。此外,PixArt-\Sigma生成4K图像的能力支持高分辨率海报和壁纸的制作,有效地增强了电影和游戏等行业高质量视觉内容的生产。
English
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer
model~(DiT) capable of directly generating images at 4K resolution.
PixArt-\Sigma represents a significant advancement over its predecessor,
PixArt-\alpha, offering images of markedly higher fidelity and improved
alignment with text prompts. A key feature of PixArt-\Sigma is its training
efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it
evolves from the `weaker' baseline to a `stronger' model via incorporating
higher quality data, a process we term "weak-to-strong training". The
advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data:
PixArt-\Sigma incorporates superior-quality image data, paired with more
precise and detailed image captions. (2) Efficient Token Compression: we
propose a novel attention module within the DiT framework that compresses both
keys and values, significantly improving efficiency and facilitating
ultra-high-resolution image generation. Thanks to these improvements,
PixArt-\Sigma achieves superior image quality and user prompt adherence
capabilities with significantly smaller model size (0.6B parameters) than
existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD
Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K
images supports the creation of high-resolution posters and wallpapers,
efficiently bolstering the production of high-quality visual content in
industries such as film and gaming.