SnapFusion：移动设备上的文本到图像扩散模型，在两秒内

摘要

文本到图像扩散模型能够根据自然语言描述生成令人惊叹的图像，与专业艺术家和摄影师的作品不相上下。然而，这些模型规模庞大，具有复杂的网络架构和数十次去噪迭代，使其在计算上昂贵且运行缓慢。因此，需要高端GPU和基于云的推断来大规模运行扩散模型。这既昂贵又涉及隐私问题，尤其是当用户数据发送给第三方时。为了克服这些挑战，我们提出了一种通用方法，首次实现在移动设备上在不到2秒内运行文本到图像扩散模型。我们通过引入高效的网络架构和改进步骤蒸馏来实现这一目标。具体而言，我们通过识别原始模型的冗余并通过数据蒸馏减少图像解码器的计算，提出了一种高效的UNet。此外，我们通过探索训练策略和引入无分类器指导的正则化，增强了步骤蒸馏。我们在MS-COCO上进行了大量实验，结果显示，我们的模型在8个去噪步骤下比稳定扩散v1.5的50个步骤获得了更好的FID和CLIP分数。我们的工作通过将强大的文本到图像扩散模型带到用户手中，使内容创作民主化。

English

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than 2 seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with 8 denoising steps achieves better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

SnapFusion：移动设备上的文本到图像扩散模型，在两秒内

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

摘要

Support