連続時間一貫性モデルの単純化、安定化、およびスケーリング

要旨

一貫性モデル（CM）は、高速サンプリングに最適化された拡散ベースの生成モデルの強力なクラスです。ほとんどの既存のCMは、追加のハイパーパラメータを導入し、離散化エラーに対して脆弱である離散化されたタイムステップを使用してトレーニングされています。一方、連続時間の定式化はこれらの問題を緩和できますが、トレーニングの不安定性によって成功が制限されています。これを解決するために、我々は、拡散モデルとCMの以前のパラメータ化を統一し、不安定性の根本原因を特定する簡略化された理論的枠組みを提案します。この分析に基づいて、拡散プロセスのパラメータ化、ネットワークアーキテクチャ、およびトレーニング目標の主要な改善を導入します。これらの変更により、我々は画像ネット512x512で15億のパラメータに達する前例のないスケールで連続時間CMをトレーニングできるようになりました。提案されたトレーニングアルゴリズムは、わずか2つのサンプリングステップを使用して、CIFAR-10で2.06、ImageNet 64x64で1.48、ImageNet 512x512で1.88のFIDスコアを達成し、最高の既存の拡散モデルとのFIDスコアの差を10％以内に縮小させました。

English

Consistency models (CMs) are a powerful class of diffusion-based generative models optimized for fast sampling. Most existing CMs are trained using discretized timesteps, which introduce additional hyperparameters and are prone to discretization errors. While continuous-time formulations can mitigate these issues, their success has been limited by training instability. To address this, we propose a simplified theoretical framework that unifies previous parameterizations of diffusion models and CMs, identifying the root causes of instability. Based on this analysis, we introduce key improvements in diffusion process parameterization, network architecture, and training objectives. These changes enable us to train continuous-time CMs at an unprecedented scale, reaching 1.5B parameters on ImageNet 512x512. Our proposed training algorithm, using only two sampling steps, achieves FID scores of 2.06 on CIFAR-10, 1.48 on ImageNet 64x64, and 1.88 on ImageNet 512x512, narrowing the gap in FID scores with the best existing diffusion models to within 10%.

連続時間一貫性モデルの単純化、安定化、およびスケーリング

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

要旨

Support