ViBiDSampler：利用雙向擴散取樣器增強視頻插值

摘要

最近在大規模文本到視頻（T2V）和圖像到視頻（I2V）擴散模型方面取得的進展，極大地增強了視頻生成的能力，特別是在關鍵幀插值方面。然而，目前的圖像到視頻擴散模型，雖然在從單個條件幀生成視頻方面很強大，但需要適應兩幀（起始和結束）條件生成，這對於有效的有界插值至關重要。不幸的是，現有的將時間向前和向後路徑並行融合的方法通常會出現離群問題，導致產生瑕疵或需要多次迭代重新加噪。在這項工作中，我們引入了一種新穎的雙向採樣策略，以解決這些離群問題，而無需進行大量重新加噪或微調。我們的方法沿著向前和向後路徑進行順序採樣，分別以起始幀和結束幀為條件，確保生成中間幀更具連貫性並且在流形上。此外，我們還融入了先進的引導技術，CFG++ 和 DDS，以進一步增強插值過程。通過整合這些技術，我們的方法實現了最先進的性能，高效生成在關鍵幀之間高質量、流暢的視頻。在單個 3090 GPU 上，我們的方法可以在僅 195 秒內以 1024 x 576 的分辨率插補 25 幀，使其成為關鍵幀插值的領先解決方案。

English

Recent progress in large-scale text-to-video (T2V) and image-to-video (I2V) diffusion models has greatly enhanced video generation, especially in terms of keyframe interpolation. However, current image-to-video diffusion models, while powerful in generating videos from a single conditioning frame, need adaptation for two-frame (start & end) conditioned generation, which is essential for effective bounded interpolation. Unfortunately, existing approaches that fuse temporally forward and backward paths in parallel often suffer from off-manifold issues, leading to artifacts or requiring multiple iterative re-noising steps. In this work, we introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames. Additionally, we incorporate advanced guidance techniques, CFG++ and DDS, to further enhance the interpolation process. By integrating these, our method achieves state-of-the-art performance, efficiently generating high-quality, smooth videos between keyframes. On a single 3090 GPU, our method can interpolate 25 frames at 1024 x 576 resolution in just 195 seconds, establishing it as a leading solution for keyframe interpolation.

ViBiDSampler：利用雙向擴散取樣器增強視頻插值

ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

摘要

Support