SnapGen: 効率的なアーキテクチャとトレーニングを用いたモバイルデバイス向け高解像度テキストから画像へのモデルの制御

要旨

既存のテキストから画像（T2I）拡散モデルは、大きなモデルサイズ、遅いランタイム、およびモバイルデバイスでの低品質生成など、いくつかの制限に直面しています。本論文では、これらの課題すべてに対処することを目的とし、極めて小さく高速なT2Iモデルを開発し、モバイルプラットフォームで高解像度かつ高品質な画像を生成します。この目標を達成するために、いくつかの手法を提案します。まず、モデルパラメータとレイテンシを削減し、高品質な生成を確保するために、ネットワークアーキテクチャの設計選択肢を体系的に検討します。次に、生成品質をさらに向上させるために、より大きなモデルからのクロスアーキテクチャ知識蒸留を採用し、マルチレベルアプローチを使用して、当社のモデルのトレーニングをゼロからガイドします。第三に、敵対的なガイダンスと知識蒸留を統合して、数段階の生成を可能にします。初めて、当社のモデルSnapGenは、モバイルデバイスで1024x1024ピクセルの画像を約1.4秒で生成します。ImageNet-1Kでは、わずか372Mのパラメータで、256x256ピクセルの生成に対してFID値2.06を達成します。T2Iベンチマーク（GenEvalおよびDPG-Bench）では、わずか379Mのパラメータで、数十億のパラメータを持つ大規模モデルを大幅に下回り（たとえば、SDXLより7倍小さく、IF-XLより14倍小さい）、優れた性能を発揮します。

English

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

SnapGen: 効率的なアーキテクチャとトレーニングを用いたモバイルデバイス向け高解像度テキストから画像へのモデルの制御

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

要旨

Support