DCM: 効率的かつ高品質なビデオ生成のためのデュアルエキスパート一貫性モデル

要旨

拡散モデルはビデオ合成において顕著な成果を達成しているが、反復的なノイズ除去ステップを必要とするため、計算コストが大幅に増大する。一貫性モデルは拡散モデルの高速化において重要な進展を遂げている。しかし、これらを直接ビデオ拡散モデルに適用すると、時間的な一貫性や外観の詳細が著しく劣化することが多い。本論文では、一貫性モデルの学習動態を分析し、蒸留プロセスにおける重要な学習動態の矛盾を特定する：異なるタイムステップ間で最適化勾配と損失寄与に大きな不一致が生じる。この不一致により、蒸留された学生モデルが最適な状態に到達できず、時間的な一貫性が損なわれ、外観の詳細が劣化する。この問題を解決するため、パラメータ効率の良いデュアルエキスパート一貫性モデル（DCM）を提案する。ここでは、セマンティックエキスパートがセマンティックレイアウトとモーションの学習に焦点を当て、ディテールエキスパートが細部の洗練に特化する。さらに、セマンティックエキスパートのモーション一貫性を向上させるために時間的コヒーレンス損失を導入し、ディテールエキスパートの合成品質を向上させるためにGANと特徴マッチング損失を適用する。我々のアプローチは、サンプリングステップを大幅に削減しながら、最先端の視覚品質を達成し、ビデオ拡散モデルの蒸習におけるエキスパート特化の有効性を実証する。我々のコードとモデルはhttps://github.com/Vchitect/DCM{https://github.com/Vchitect/DCM}で公開されている。

English

Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient Dual-Expert Consistency Model~(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at https://github.com/Vchitect/DCM{https://github.com/Vchitect/DCM}.

DCM: 効率的かつ高品質なビデオ生成のためのデュアルエキスパート一貫性モデル

DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

要旨

Support