SnapFusion: 2초 이내에 모바일 기기에서 실행 가능한 텍스트-이미지 확산 모델

초록

텍스트-이미지 확산 모델은 자연어 설명으로부터 전문 예술가와 사진작가의 작품에 필적하는 놀라운 이미지를 생성할 수 있습니다. 그러나 이러한 모델은 크기가 크고 복잡한 네트워크 아키텍처와 수십 번의 노이즈 제거 반복을 거치기 때문에 계산 비용이 많이 들고 실행 속도가 느립니다. 결과적으로 확산 모델을 대규모로 실행하려면 고성능 GPU와 클라우드 기반 추론이 필요합니다. 이는 비용이 많이 들 뿐만 아니라, 특히 사용자 데이터가 제3자에게 전송될 때 프라이버시 문제를 야기합니다. 이러한 문제를 해결하기 위해, 우리는 모바일 기기에서 2초 미만으로 텍스트-이미지 확산 모델을 실행할 수 있는 일반적인 접근 방식을 최초로 제시합니다. 이를 위해 효율적인 네트워크 아키텍처를 도입하고 단계 증류를 개선했습니다. 구체적으로, 우리는 원본 모델의 중복성을 식별하고 데이터 증류를 통해 이미지 디코더의 계산을 줄이는 효율적인 UNet을 제안합니다. 또한, 훈련 전략을 탐구하고 분류자 없는 지도에서 정규화를 도입하여 단계 증류를 강화했습니다. MS-COCO에 대한 광범위한 실험에서, 우리의 모델은 8단계 노이즈 제거로 Stable Diffusion v1.5의 50단계보다 더 나은 FID와 CLIP 점수를 달성했습니다. 우리의 작업은 강력한 텍스트-이미지 확산 모델을 사용자의 손에 넣어줌으로써 콘텐츠 창작을 민주화합니다.

English

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than 2 seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with 8 denoising steps achieves better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

SnapFusion: 2초 이내에 모바일 기기에서 실행 가능한 텍스트-이미지 확산 모델

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

초록

Support