大規模拡散蒸留におけるスコア正則化連続時間整合性

要旨

本研究は、連続時間整合性蒸留を一般のアプリケーションレベルの画像およびビデオ拡散モデルにスケールアップする初めての試みを提示する。連続時間整合性モデル（sCM）は、学術規模の拡散加速において理論的に正当化され、経験的に強力であるが、大規模なテキストから画像やビデオタスクへの適用性は、ヤコビアン-ベクトル積（JVP）計算のインフラストラクチャ上の課題や標準評価ベンチマークの制限により不明確であった。我々はまず、並列処理に対応したFlashAttention-2 JVPカーネルを開発し、100億パラメータを超えるモデルや高次元ビデオタスクでのsCMトレーニングを可能にした。調査の結果、sCMには微細な詳細生成における根本的な品質制限があることが明らかとなり、これは誤差蓄積とその前方発散目的の「モードカバリング」性質に起因すると考えられる。これを改善するため、スコア正則化連続時間整合性モデル（rCM）を提案し、スコア蒸留を長距離スキップ正則化として組み込んだ。この統合により、sCMに「モードシーキング」の逆発散を補完し、視覚品質を効果的に向上させながら高い生成多様性を維持することができる。大規模モデル（Cosmos-Predict2、Wan2.1）および5秒ビデオにおいて最大14Bパラメータで検証された結果、rCMは品質指標において最先端の蒸留手法DMD2に匹敵またはそれを上回り、多様性においても顕著な利点を示し、GANチューニングや広範なハイパーパラメータ探索を必要としない。蒸留されたモデルは1～4ステップで高忠実度サンプルを生成し、拡散サンプリングを15～50倍加速する。これらの結果は、rCMを大規模拡散蒸留を進めるための実用的かつ理論的に根拠のあるフレームワークとして位置づけるものである。

English

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only 1sim4 steps, accelerating diffusion sampling by 15timessim50times. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

大規模拡散蒸留におけるスコア正則化連続時間整合性

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

要旨

Support