ViBiDSampler: 双方向拡散サンプラーを用いたビデオ補間の向上

要旨

最近の大規模テキストからビデオ（T2V）および画像からビデオ（I2V）拡散モデルの進歩は、特にキーフレーム補間において、ビデオ生成を大幅に向上させました。ただし、現在の画像からビデオへの拡散モデルは、単一の条件付けフレームからビデオを生成する際には強力ですが、効果的な境界補間に不可欠な2フレーム（開始＆終了）条件付け生成には適応が必要です。残念ながら、時間的に前方と後方のパスを並行して融合する既存のアプローチは、しばしばオフマニフォールドの問題に苦しんでおり、アーティファクトを生じるか、複数の反復的な再ノイズ化ステップを必要とします。本研究では、これらのオフマニフォールドの問題に対処するための革新的な双方向サンプリング戦略を導入します。この方法は、開始フレームと終了フレームに応じて、前方と後方のパスに沿って順次サンプリングを行い、中間フレームのより一貫したおよびオンマニフォールドな生成を確実にします。さらに、CFG++およびDDSといった高度なガイダンス技術を組み込み、補間プロセスをさらに強化します。これらを統合することで、当該手法は最先端の性能を達成し、高品質で滑らかなビデオをキーフレーム間で効率的に生成します。3090 GPU1枚で、当該手法は1024 x 576の解像度で25フレームをわずか195秒で補間し、キーフレーム補間の先進的なソリューションとして確立されています。

English

Recent progress in large-scale text-to-video (T2V) and image-to-video (I2V) diffusion models has greatly enhanced video generation, especially in terms of keyframe interpolation. However, current image-to-video diffusion models, while powerful in generating videos from a single conditioning frame, need adaptation for two-frame (start & end) conditioned generation, which is essential for effective bounded interpolation. Unfortunately, existing approaches that fuse temporally forward and backward paths in parallel often suffer from off-manifold issues, leading to artifacts or requiring multiple iterative re-noising steps. In this work, we introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames. Additionally, we incorporate advanced guidance techniques, CFG++ and DDS, to further enhance the interpolation process. By integrating these, our method achieves state-of-the-art performance, efficiently generating high-quality, smooth videos between keyframes. On a single 3090 GPU, our method can interpolate 25 frames at 1024 x 576 resolution in just 195 seconds, establishing it as a leading solution for keyframe interpolation.