CogView3:通过中继扩散实现更精细更快速的文本到图像生成。
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
March 8, 2024
作者: Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang
cs.AI
摘要
最近文本到图像生成系统的进展主要受到扩散模型的推动。然而,单阶段文本到图像扩散模型仍然面临着计算效率和图像细节精炼方面的挑战。为了解决这一问题,我们提出了CogView3,这是一种创新的级联框架,可以提升文本到图像扩散的性能。CogView3是第一个在文本到图像生成领域实现中继扩散的模型,通过首先创建低分辨率图像,然后应用基于中继的超分辨率来执行任务。这种方法不仅产生了具有竞争力的文本到图像输出,而且极大地减少了训练和推断成本。我们的实验结果表明,CogView3在人类评估中比当前最先进的开源文本到图像扩散模型SDXL表现提高了77.0\%,同时仅需要大约1/2的推断时间。CogView3的精简变体在仅利用SDXL推断时间的1/10的情况下实现了可比的性能。
English
Recent advancements in text-to-image generative systems have been largely
driven by diffusion models. However, single-stage text-to-image diffusion
models still face challenges, in terms of computational efficiency and the
refinement of image details. To tackle the issue, we propose CogView3, an
innovative cascaded framework that enhances the performance of text-to-image
diffusion. CogView3 is the first model implementing relay diffusion in the
realm of text-to-image generation, executing the task by first creating
low-resolution images and subsequently applying relay-based super-resolution.
This methodology not only results in competitive text-to-image outputs but also
greatly reduces both training and inference costs. Our experimental results
demonstrate that CogView3 outperforms SDXL, the current state-of-the-art
open-source text-to-image diffusion model, by 77.0\% in human evaluations, all
while requiring only about 1/2 of the inference time. The distilled variant of
CogView3 achieves comparable performance while only utilizing 1/10 of the
inference time by SDXL.