MLCM: 潜在拡散モデルの多段階整合性蒸留

要旨

大規模な潜在拡散モデル（LDMs）を高速サンプリング可能なモデルに蒸留することは、研究の関心を集めつつある。しかし、既存の手法の多くは、以下のジレンマに直面している：（i）異なるサンプリング予算に対して複数の個別の蒸留モデルに依存するか、（ii）限定的（例：2-4ステップ）または中程度（例：5-8ステップ）のサンプリングステップで生成品質を犠牲にするかである。これらの課題に対処するため、我々は最近のマルチステップ一貫性蒸留（MCD）戦略を代表的なLDMsに拡張し、低コストで高品質な画像合成のためのマルチステップ潜在一貫性モデル（MLCMs）アプローチを確立した。MLCMは、MCDの約束により、様々なサンプリングステップに対応する統一モデルとして機能する。さらに、MCDを段階的トレーニング戦略で強化し、セグメント間の一貫性を高めることで、少ステップ生成の品質を向上させた。教師モデルのサンプリング軌跡から得られた状態をMLCMsのトレーニングデータとして利用し、高品質なトレーニングデータセットの要件を緩和し、蒸留モデルのトレーニングと推論のギャップを埋めた。MLCMは、視覚品質と美的魅力をさらに向上させるための選好学習戦略と互換性がある。実験的に、MLCMはわずか2-8ステップで高品質で魅力的な画像を生成できる。MSCOCO-2017 5Kベンチマークにおいて、SDXLから蒸留されたMLCMは、4ステップでCLIPスコア33.30、美的スコア6.19、画像報酬1.20を達成し、4ステップのLCM [23]、8ステップのSDXL-Lightning [17]、8ステップのHyperSD [33]を大幅に上回った。また、MLCMsの応用として、制御可能な生成、画像スタイル転送、中国語から画像生成などの多様性も実証した。

English

Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To address these, we extend the recent multistep consistency distillation (MCD) strategy to representative LDMs, establishing the Multistep Latent Consistency Models (MLCMs) approach for low-cost high-quality image synthesis. MLCM serves as a unified model for various sampling steps due to the promise of MCD. We further augment MCD with a progressive training strategy to strengthen inter-segment consistency to boost the quality of few-step generations. We take the states from the sampling trajectories of the teacher model as training data for MLCMs to lift the requirements for high-quality training datasets and to bridge the gap between the training and inference of the distilled model. MLCM is compatible with preference learning strategies for further improvement of visual quality and aesthetic appeal. Empirically, MLCM can generate high-quality, delightful images with only 2-8 sampling steps. On the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps, substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and 8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.

MLCM: 潜在拡散モデルの多段階整合性蒸留

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

要旨

Support