PanGu-Draw:通过时间解耦训练和可重用的合作扩散推进资源高效的文本到图像合成
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
December 27, 2023
作者: Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen Xu, Zeyi Huang, Zhao Zhong, Wei Zhang, Hang Xu
cs.AI
摘要
当前的大规模扩散模型代表了条件图像合成的一大进步,能够解释各种线索,如文本、人体姿势和边缘。然而,它们对大量计算资源和广泛数据收集的依赖仍然是一个瓶颈。另一方面,现有扩散模型的整合,每个模型针对不同控制因素,在独特的潜在空间中运行,由于图像分辨率和潜在空间嵌入结构不兼容,使它们的联合使用受到挑战。为了解决这些限制,我们提出了“盘古绘”,这是一种新颖的潜在扩散模型,专为资源高效的文本到图像合成而设计,能够灵活地适应多个控制信号。我们首先提出了一种资源高效的时间解耦训练策略,将整体文本到图像模型分为结构生成器和纹理生成器。每个生成器都使用一种方案进行训练,最大限度地利用数据并提高计算效率,将数据准备减少了48%,训练资源减少了51%。其次,我们引入了“合作扩散”算法,实现了在统一去噪过程中协同使用各种预训练的扩散模型,这些模型具有不同的潜在空间和预定义分辨率。这使得能够在任意分辨率进行多控制图像合成,而无需额外数据或重新训练。盘古绘的实证验证展示了它在文本到图像和多控制图像生成方面的卓越能力,为未来模型训练效率和生成多样性指明了一个有前途的方向。最大的5B T2I盘古绘模型已发布在Ascend平台上。项目页面:https://pangu-draw.github.io
English
Current large-scale diffusion models represent a giant leap forward in
conditional image synthesis, capable of interpreting diverse cues like text,
human poses, and edges. However, their reliance on substantial computational
resources and extensive data collection remains a bottleneck. On the other
hand, the integration of existing diffusion models, each specialized for
different controls and operating in unique latent spaces, poses a challenge due
to incompatible image resolutions and latent space embedding structures,
hindering their joint use. Addressing these constraints, we present
"PanGu-Draw", a novel latent diffusion model designed for resource-efficient
text-to-image synthesis that adeptly accommodates multiple control signals. We
first propose a resource-efficient Time-Decoupling Training Strategy, which
splits the monolithic text-to-image model into structure and texture
generators. Each generator is trained using a regimen that maximizes data
utilization and computational efficiency, cutting data preparation by 48% and
reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an
algorithm that enables the cooperative use of various pre-trained diffusion
models with different latent spaces and predefined resolutions within a unified
denoising process. This allows for multi-control image synthesis at arbitrary
resolutions without the necessity for additional data or retraining. Empirical
validations of Pangu-Draw show its exceptional prowess in text-to-image and
multi-control image generation, suggesting a promising direction for future
model training efficiencies and generation versatility. The largest 5B T2I
PanGu-Draw model is released on the Ascend platform. Project page:
https://pangu-draw.github.io