TLB-VFI:基于时序感知的潜在布朗桥扩散视频帧插值
TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
July 7, 2025
作者: Zonglin Lyu, Chen Chen
cs.AI
摘要
视频帧插值(Video Frame Interpolation, VFI)旨在基于两个连续相邻帧I_0和I_1预测中间帧I_n(此处用n表示视频中的时间,以避免与扩散模型中的时间步t混淆)。近期研究将扩散模型(包括基于图像和基于视频的)应用于此任务,并取得了显著成效。然而,基于图像的扩散模型无法提取时间信息,且相较于非扩散方法效率较低。基于视频的扩散模型虽能提取时间信息,但其在训练规模、模型大小及推理时间上过于庞大。为解决上述问题,我们提出了时间感知潜在布朗桥扩散视频帧插值(Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation, TLB-VFI),一种高效的基于视频的扩散模型。通过我们提出的3D小波门控和时间感知自编码器从视频输入中提取丰富的时间信息,我们的方法在最具挑战性的数据集上,相较于最新的基于图像的扩散模型,FID指标提升了20%。同时,得益于丰富的时间信息,我们的方法在参数数量减少3倍的情况下仍保持强劲性能,这一参数缩减带来了2.3倍的加速。通过引入光流指导,我们的方法所需训练数据减少了9000倍,且参数数量比基于视频的扩散模型少20倍以上。代码与结果详见项目页面:https://zonglinl.github.io/tlbvfi_page。
English
Video Frame Interpolation (VFI) aims to predict the intermediate frame I_n
(we use n to denote time in videos to avoid notation overload with the timestep
t in diffusion models) based on two consecutive neighboring frames I_0 and
I_1. Recent approaches apply diffusion models (both image-based and
video-based) in this task and achieve strong performance. However, image-based
diffusion models are unable to extract temporal information and are relatively
inefficient compared to non-diffusion methods. Video-based diffusion models can
extract temporal information, but they are too large in terms of training
scale, model size, and inference time. To mitigate the above issues, we propose
Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
(TLB-VFI), an efficient video-based diffusion model. By extracting rich
temporal information from video inputs through our proposed 3D-wavelet gating
and temporal-aware autoencoder, our method achieves 20% improvement in FID on
the most challenging datasets over recent SOTA of image-based diffusion models.
Meanwhile, due to the existence of rich temporal information, our method
achieves strong performance while having 3times fewer parameters. Such a
parameter reduction results in 2.3x speed up. By incorporating optical flow
guidance, our method requires 9000x less training data and achieves over 20x
fewer parameters than video-based diffusion models. Codes and results are
available at our project page: https://zonglinl.github.io/tlbvfi_page.