CogView3:通過中繼擴散實現更精細更快速的文本到圖像生成
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
March 8, 2024
作者: Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang
cs.AI
摘要
最近在文本到圖像生成系統方面的進展主要是由擴散模型推動的。然而,單階段文本到圖像擴散模型仍然面臨著計算效率和圖像細節的改進方面的挑戰。為了應對這個問題,我們提出了CogView3,這是一個創新的級聯框架,可以增強文本到圖像擴散的性能。CogView3是第一個在文本到圖像生成領域實現中繼擴散的模型,通過首先創建低分辨率圖像,然後應用基於中繼的超分辨率來執行任務。這種方法不僅產生具有競爭力的文本到圖像輸出,還大大降低了訓練和推理成本。我們的實驗結果表明,CogView3在人類評估方面比當前最先進的開源文本到圖像擴散模型SDXL表現優異,性能提高了77.0%,同時推理時間僅需SDXL的約1/2。CogView3的精煉變體實現了可比的性能,同時只利用SDXL推理時間的1/10。
English
Recent advancements in text-to-image generative systems have been largely
driven by diffusion models. However, single-stage text-to-image diffusion
models still face challenges, in terms of computational efficiency and the
refinement of image details. To tackle the issue, we propose CogView3, an
innovative cascaded framework that enhances the performance of text-to-image
diffusion. CogView3 is the first model implementing relay diffusion in the
realm of text-to-image generation, executing the task by first creating
low-resolution images and subsequently applying relay-based super-resolution.
This methodology not only results in competitive text-to-image outputs but also
greatly reduces both training and inference costs. Our experimental results
demonstrate that CogView3 outperforms SDXL, the current state-of-the-art
open-source text-to-image diffusion model, by 77.0\% in human evaluations, all
while requiring only about 1/2 of the inference time. The distilled variant of
CogView3 achieves comparable performance while only utilizing 1/10 of the
inference time by SDXL.