拡散モデルの学習ダイナミクスの分析と改善

要旨

拡散モデルは現在、大規模データセットへの比類のないスケーリング能力により、データ駆動型の画像合成分野を支配しています。本論文では、人気のADM拡散モデルアーキテクチャにおいて、その高レベルな構造を変更することなく、不均一で非効率なトレーニングの原因を特定し、修正します。トレーニング過程におけるネットワークの活性化と重みの制御不能な大きさの変化と不均衡を観察し、活性化、重み、更新の大きさを期待値で保存するようにネットワーク層を再設計しました。この哲学を体系的に適用することで、観察されたドリフトと不均衡を排除し、同等の計算複雑度で大幅に優れたネットワークを実現できることがわかりました。私たちの修正により、ImageNet-512合成における従来の最高記録FID 2.41を、高速な決定論的サンプリングを用いて1.81に改善しました。独立した貢献として、トレーニング実行完了後に指数移動平均（EMA）パラメータを事後的に設定する方法を提示します。これにより、複数のトレーニング実行を必要とせずにEMAの長さを精密に調整できるようになり、ネットワークアーキテクチャ、トレーニング時間、ガイダンスとの驚くべき相互作用を明らかにします。

English

Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.

拡散モデルの学習ダイナミクスの分析と改善

Analyzing and Improving the Training Dynamics of Diffusion Models

要旨

Support