ChatPaper.aiChatPaper

PixArt-Σ:Diffusion Transformer的弱到强训练用于4K文本到图像生成。

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

March 7, 2024
作者: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li
cs.AI

摘要

本文介绍了PixArt-\Sigma,一种Diffusion Transformer模型(DiT),能够直接生成4K分辨率的图像。PixArt-\Sigma相比其前身PixArt-\alpha有了显著进步,提供了质量更高且与文本提示更好对齐的图像。PixArt-\Sigma的一个关键特点是其训练效率。利用PixArt-\alpha的基础预训练,它通过融合更高质量的数据从“较弱”的基线发展为“更强”的模型,这一过程我们称之为“弱到强训练”。PixArt-\Sigma的进步有两个方面:(1)高质量训练数据:PixArt-\Sigma融合了优质的图像数据,并配以更精确和详细的图像标题。(2)高效Token压缩:我们在DiT框架内提出了一个新颖的注意力模块,可以压缩键和值,显著提高效率,并促进超高分辨率图像生成。由于这些改进,PixArt-\Sigma在模型尺寸明显更小(0.6B参数)的情况下,实现了优越的图像质量和用户提示遵从能力,远胜于现有的文本到图像扩散模型,如SDXL(2.6B参数)和SD Cascade(5.1B参数)。此外,PixArt-\Sigma生成4K图像的能力支持高分辨率海报和壁纸的制作,有效地增强了电影和游戏等行业高质量视觉内容的生产。
English
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
PDF421December 15, 2024