PIXART-δ:具有潛在一致性模型的快速可控圖像生成
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models
January 10, 2024
作者: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li
cs.AI
摘要
本技術報告介紹了PIXART-{\delta},一個將潛在一致性模型(LCM)和ControlNet整合到先進的PIXART-{\alpha}模型中的文本到圖像合成框架。PIXART-{\alpha}以其能夠通過非常高效的訓練過程生成1024px分辨率的高質量圖像而聞名。在PIXART-{\delta}中整合LCM顯著加快了推理速度,使得僅需2-4個步驟即可生成高質量圖像。值得注意的是,PIXART-{\delta}實現了在0.5秒內生成1024x1024像素圖像的突破,比PIXART-{\alpha}提高了7倍。此外,PIXART-{\delta}設計為在32GB V100 GPU上在一天內高效訓練。憑藉其8位推理能力(von Platen等,2023年),PIXART-{\delta}可以在8GB GPU內存限制下合成1024px圖像,大大提高了其可用性和可訪問性。此外,引入類似ControlNet的模塊使得對文本到圖像擴散模型進行精細控制成為可能。我們介紹了一種新穎的ControlNet-Transformer架構,專門為Transformer定制,實現了明確的可控性以及高質量圖像生成。作為一種最先進的開源圖像生成模型,PIXART-{\delta}為Stable Diffusion系列模型提供了一個有前途的替代方案,對文本到圖像合成做出了重大貢獻。
English
This technical report introduces PIXART-{\delta}, a text-to-image synthesis
framework that integrates the Latent Consistency Model (LCM) and ControlNet
into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its
ability to generate high-quality images of 1024px resolution through a
remarkably efficient training process. The integration of LCM in
PIXART-{\delta} significantly accelerates the inference speed, enabling the
production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta}
achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images,
marking a 7x improvement over the PIXART-{\alpha}. Additionally,
PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs
within a single day. With its 8-bit inference capability (von Platen et al.,
2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory
constraints, greatly enhancing its usability and accessibility. Furthermore,
incorporating a ControlNet-like module enables fine-grained control over
text-to-image diffusion models. We introduce a novel ControlNet-Transformer
architecture, specifically tailored for Transformers, achieving explicit
controllability alongside high-quality image generation. As a state-of-the-art,
open-source image generation model, PIXART-{\delta} offers a promising
alternative to the Stable Diffusion family of models, contributing
significantly to text-to-image synthesis.