PanGu-Draw:透過時間解耦訓練和可重複使用的合作擴散,推進資源高效的文本到圖像合成。
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
December 27, 2023
作者: Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen Xu, Zeyi Huang, Zhao Zhong, Wei Zhang, Hang Xu
cs.AI
摘要
目前的大規模擴散模型在條件圖像合成方面取得了巨大的進步,能夠解釋各種提示,如文本、人體姿勢和邊緣。然而,它們對大量計算資源和廣泛數據收集的依賴仍然是一個瓶頸。另一方面,現有擴散模型的整合,每個模型專門用於不同的控制並在獨特的潛在空間中運作,由於圖像分辨率和潛在空間嵌入結構不相容而面臨挑戰,阻礙了它們的聯合使用。為了應對這些限制,我們提出了一種新穎的潛在擴散模型,名為"PanGu-Draw",旨在實現資源高效的文本到圖像合成,能靈活適應多個控制信號。首先,我們提出了一種資源高效的時間解耦訓練策略,將整體的文本到圖像模型分為結構生成器和紋理生成器。通過最大程度地利用數據並提高計算效率的訓練方式,每個生成器可將數據準備減少48%,並將訓練資源減少51%。其次,我們引入了"Coop-Diffusion"算法,使各種預先訓練的擴散模型能夠在統一的去噪過程中合作使用,這些模型具有不同的潛在空間和預定分辨率。這使得能夠在任意分辨率下進行多控制圖像合成,而無需額外的數據或重新訓練。對PanGu-Draw的實證驗證顯示了其在文本到圖像和多控制圖像生成方面的卓越能力,為未來模型訓練效率和生成多樣性指明了一個有前景的方向。最大的5B T2I PanGu-Draw模型已在Ascend平台上發布。項目頁面:https://pangu-draw.github.io
English
Current large-scale diffusion models represent a giant leap forward in
conditional image synthesis, capable of interpreting diverse cues like text,
human poses, and edges. However, their reliance on substantial computational
resources and extensive data collection remains a bottleneck. On the other
hand, the integration of existing diffusion models, each specialized for
different controls and operating in unique latent spaces, poses a challenge due
to incompatible image resolutions and latent space embedding structures,
hindering their joint use. Addressing these constraints, we present
"PanGu-Draw", a novel latent diffusion model designed for resource-efficient
text-to-image synthesis that adeptly accommodates multiple control signals. We
first propose a resource-efficient Time-Decoupling Training Strategy, which
splits the monolithic text-to-image model into structure and texture
generators. Each generator is trained using a regimen that maximizes data
utilization and computational efficiency, cutting data preparation by 48% and
reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an
algorithm that enables the cooperative use of various pre-trained diffusion
models with different latent spaces and predefined resolutions within a unified
denoising process. This allows for multi-control image synthesis at arbitrary
resolutions without the necessity for additional data or retraining. Empirical
validations of Pangu-Draw show its exceptional prowess in text-to-image and
multi-control image generation, suggesting a promising direction for future
model training efficiencies and generation versatility. The largest 5B T2I
PanGu-Draw model is released on the Ascend platform. Project page:
https://pangu-draw.github.io