TLB-VFI:基於時間感知的潛在布朗橋擴散模型之視頻幀插值
TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
July 7, 2025
作者: Zonglin Lyu, Chen Chen
cs.AI
摘要
視頻幀插值(Video Frame Interpolation, VFI)旨在基於兩個連續的相鄰幀I_0和I_1來預測中間幀I_n(我們使用n來表示視頻中的時間,以避免與擴散模型中的時間步t產生符號重疊)。近期的方法在這一任務中應用了擴散模型(包括基於圖像的和基於視頻的),並取得了強勁的性能。然而,基於圖像的擴散模型無法提取時間信息,且與非擴散方法相比效率較低。基於視頻的擴散模型雖能提取時間信息,但其在訓練規模、模型大小和推理時間上過於龐大。為緩解上述問題,我們提出了時序感知潛在布朗橋擴散視頻幀插值(Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation, TLB-VFI),這是一種高效的基於視頻的擴散模型。通過我們提出的3D小波門控和時序感知自編碼器從視頻輸入中提取豐富的時間信息,我們的方法在最具挑戰性的數據集上相較於最新的基於圖像的擴散模型,FID提升了20%。同時,由於存在豐富的時間信息,我們的方法在參數數量減少3倍的情況下仍能實現強勁的性能。這種參數的減少帶來了2.3倍的加速。通過結合光流指導,我們的方法所需的訓練數據量減少了9000倍,並且參數數量比基於視頻的擴散模型減少了20倍以上。代碼和結果可在我們的項目頁面獲取:https://zonglinl.github.io/tlbvfi_page。
English
Video Frame Interpolation (VFI) aims to predict the intermediate frame I_n
(we use n to denote time in videos to avoid notation overload with the timestep
t in diffusion models) based on two consecutive neighboring frames I_0 and
I_1. Recent approaches apply diffusion models (both image-based and
video-based) in this task and achieve strong performance. However, image-based
diffusion models are unable to extract temporal information and are relatively
inefficient compared to non-diffusion methods. Video-based diffusion models can
extract temporal information, but they are too large in terms of training
scale, model size, and inference time. To mitigate the above issues, we propose
Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
(TLB-VFI), an efficient video-based diffusion model. By extracting rich
temporal information from video inputs through our proposed 3D-wavelet gating
and temporal-aware autoencoder, our method achieves 20% improvement in FID on
the most challenging datasets over recent SOTA of image-based diffusion models.
Meanwhile, due to the existence of rich temporal information, our method
achieves strong performance while having 3times fewer parameters. Such a
parameter reduction results in 2.3x speed up. By incorporating optical flow
guidance, our method requires 9000x less training data and achieves over 20x
fewer parameters than video-based diffusion models. Codes and results are
available at our project page: https://zonglinl.github.io/tlbvfi_page.