ZeroSmooth: 高フレームレート動画生成のためのトレーニング不要なDiffuser適応

要旨

近年、特にビデオ拡散モデルの登場以来、ビデオ生成は著しい進歩を遂げています。多くのビデオ生成モデルは、例えばStable Video Diffusion（SVD）のように、説得力のある合成ビデオを生成することができます。しかし、ほとんどのビデオモデルは、GPUメモリの制約や多数のフレームをモデル化する難しさから、低フレームレートのビデオしか生成できません。トレーニングビデオは常に指定された間隔で均一にサンプリングされ、時間的な圧縮が行われます。従来の方法では、ピクセル空間でのビデオ補間モデルを後処理段階としてトレーニングするか、特定のベースビデオモデルに対して潜在空間での補間モデルをトレーニングすることで、フレームレートを向上させていました。本論文では、生成ビデオ拡散モデルに対するトレーニング不要のビデオ補間方法を提案し、これはプラグアンドプレイ方式で異なるモデルに一般化可能です。ビデオ拡散モデルの特徴空間における非線形性を調査し、設計された隠れ状態補正モジュールを組み込むことで、ビデオモデルを自己カスケード型ビデオ拡散モデルに変換します。自己カスケード型アーキテクチャと補正モジュールは、キーフレームと補間フレーム間の時間的一貫性を保持するために提案されています。複数の人気ビデオモデルに対して広範な評価を行い、提案手法の有効性を実証しました。特に、トレーニング不要の本手法は、膨大な計算リソースと大規模データセットに支えられたトレーニング済み補間モデルに匹敵する性能を示しています。

English

Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training videos are always uniformly sampled at a specified interval for temporal compression. Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model. In this paper, we propose a training-free video interpolation method for generative video diffusion models, which is generalizable to different models in a plug-and-play manner. We investigate the non-linearity in the feature space of video diffusion models and transform a video model into a self-cascaded video diffusion model with incorporating the designed hidden state correction modules. The self-cascaded architecture and the correction module are proposed to retain the temporal consistency between key frames and the interpolated frames. Extensive evaluations are preformed on multiple popular video models to demonstrate the effectiveness of the propose method, especially that our training-free method is even comparable to trained interpolation models supported by huge compute resources and large-scale datasets.

ZeroSmooth: 高フレームレート動画生成のためのトレーニング不要なDiffuser適応

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

要旨

Support