SnapFusion：2秒以内でモバイルデバイス上で動作するテキストから画像への拡散モデル

要旨

テキストから画像を生成する拡散モデルは、プロのアーティストや写真家の作品に匹敵する高品質な画像を自然言語の記述から作成することができます。しかし、これらのモデルは大規模で、複雑なネットワークアーキテクチャと数十回のノイズ除去イテレーションを必要とするため、計算コストが高く、実行速度が遅いという課題があります。その結果、拡散モデルを大規模に実行するためには、高性能なGPUやクラウドベースの推論が必要となります。これはコストがかかるだけでなく、特にユーザーデータが第三者に送信される場合、プライバシーの問題も引き起こします。これらの課題を克服するため、我々は初めて、テキストから画像を生成する拡散モデルをモバイルデバイス上で2秒未満で実行可能にする汎用的なアプローチを提案します。これを実現するために、効率的なネットワークアーキテクチャを導入し、ステップ蒸留を改善しました。具体的には、元のモデルの冗長性を特定し、データ蒸留を通じて画像デコーダの計算量を削減することで、効率的なUNetを提案します。さらに、トレーニング戦略を探求し、クラスファリーフリーガイダンスからの正則化を導入することで、ステップ蒸留を強化しました。MS-COCOでの大規模な実験により、8回のノイズ除去ステップで実行する我々のモデルが、50ステップのStable Diffusion v1.5よりも優れたFIDおよびCLIPスコアを達成することが示されました。我々の研究は、強力なテキストから画像を生成する拡散モデルをユーザーの手元に届けることで、コンテンツ作成の民主化を実現します。

English

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than 2 seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with 8 denoising steps achieves better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

SnapFusion：2秒以内でモバイルデバイス上で動作するテキストから画像への拡散モデル

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

要旨

Support