SnapFusion：在移動設備上實現兩秒內的文本到圖像擴散模型

摘要

文字到圖像擴散模型能夠從自然語言描述中創建令人驚嘆的圖像，可與專業藝術家和攝影師的作品媲美。然而，這些模型龐大，具有複雜的網絡架構和數十個去噪迭代，使其在運行時計算昂貴且速度緩慢。因此，需要高端GPU和基於雲的推斷來大規模運行擴散模型。這既昂貴又存在隱私問題，尤其是當用戶數據發送給第三方時。為了克服這些挑戰，我們提出了一種通用方法，首次實現在移動設備上運行文字到圖像擴散模型少於2秒。我們通過引入高效的網絡架構和改進步驟蒸餾來實現這一目標。具體來說，我們通過識別原始模型的冗餘性並通過數據蒸餾減少圖像解碼器的計算，提出了一種高效的UNet。此外，我們通過探索訓練策略和引入來自無分類器引導的正則化來增強步驟蒸餾。我們在MS-COCO上進行了大量實驗，結果顯示，我們的模型在8個去噪步驟下的FID和CLIP分數優於50個步驟的穩定擴散v1.5。我們的工作通過將強大的文字到圖像擴散模型帶到用戶手中，實現了內容創作的民主化。

English

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than 2 seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with 8 denoising steps achieves better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

SnapFusion：在移動設備上實現兩秒內的文本到圖像擴散模型

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

摘要

Support