SnapFusion:在移動設備上實現兩秒內的文本到圖像擴散模型
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
June 1, 2023
作者: Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, Jian Ren
cs.AI
摘要
文字到圖像擴散模型能夠從自然語言描述中創建令人驚嘆的圖像,可與專業藝術家和攝影師的作品媲美。然而,這些模型龐大,具有複雜的網絡架構和數十個去噪迭代,使其在運行時計算昂貴且速度緩慢。因此,需要高端GPU和基於雲的推斷來大規模運行擴散模型。這既昂貴又存在隱私問題,尤其是當用戶數據發送給第三方時。為了克服這些挑戰,我們提出了一種通用方法,首次實現在移動設備上運行文字到圖像擴散模型少於2秒。我們通過引入高效的網絡架構和改進步驟蒸餾來實現這一目標。具體來說,我們通過識別原始模型的冗餘性並通過數據蒸餾減少圖像解碼器的計算,提出了一種高效的UNet。此外,我們通過探索訓練策略和引入來自無分類器引導的正則化來增強步驟蒸餾。我們在MS-COCO上進行了大量實驗,結果顯示,我們的模型在8個去噪步驟下的FID和CLIP分數優於50個步驟的穩定擴散v1.5。我們的工作通過將強大的文字到圖像擴散模型帶到用戶手中,實現了內容創作的民主化。
English
Text-to-image diffusion models can create stunning images from natural
language descriptions that rival the work of professional artists and
photographers. However, these models are large, with complex network
architectures and tens of denoising iterations, making them computationally
expensive and slow to run. As a result, high-end GPUs and cloud-based inference
are required to run diffusion models at scale. This is costly and has privacy
implications, especially when user data is sent to a third party. To overcome
these challenges, we present a generic approach that, for the first time,
unlocks running text-to-image diffusion models on mobile devices in less than
2 seconds. We achieve so by introducing efficient network architecture and
improving step distillation. Specifically, we propose an efficient UNet by
identifying the redundancy of the original model and reducing the computation
of the image decoder via data distillation. Further, we enhance the step
distillation by exploring training strategies and introducing regularization
from classifier-free guidance. Our extensive experiments on MS-COCO show that
our model with 8 denoising steps achieves better FID and CLIP scores than
Stable Diffusion v1.5 with 50 steps. Our work democratizes content creation
by bringing powerful text-to-image diffusion models to the hands of users.