TLB-VFI：基于时序感知的潜在布朗桥扩散视频帧插值

摘要

视频帧插值（Video Frame Interpolation, VFI）旨在基于两个连续相邻帧I_0和I_1预测中间帧I_n（此处用n表示视频中的时间，以避免与扩散模型中的时间步t混淆）。近期研究将扩散模型（包括基于图像和基于视频的）应用于此任务，并取得了显著成效。然而，基于图像的扩散模型无法提取时间信息，且相较于非扩散方法效率较低。基于视频的扩散模型虽能提取时间信息，但其在训练规模、模型大小及推理时间上过于庞大。为解决上述问题，我们提出了时间感知潜在布朗桥扩散视频帧插值（Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation, TLB-VFI），一种高效的基于视频的扩散模型。通过我们提出的3D小波门控和时间感知自编码器从视频输入中提取丰富的时间信息，我们的方法在最具挑战性的数据集上，相较于最新的基于图像的扩散模型，FID指标提升了20%。同时，得益于丰富的时间信息，我们的方法在参数数量减少3倍的情况下仍保持强劲性能，这一参数缩减带来了2.3倍的加速。通过引入光流指导，我们的方法所需训练数据减少了9000倍，且参数数量比基于视频的扩散模型少20倍以上。代码与结果详见项目页面：https://zonglinl.github.io/tlbvfi_page。

English

Video Frame Interpolation (VFI) aims to predict the intermediate frame I_n (we use n to denote time in videos to avoid notation overload with the timestep t in diffusion models) based on two consecutive neighboring frames I_0 and I_1. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.

TLB-VFI：基于时序感知的潜在布朗桥扩散视频帧插值

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

摘要

Support