ChatPaper.aiChatPaper

TLB-VFI:基于时序感知的潜在布朗桥扩散视频帧插值

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

July 7, 2025
作者: Zonglin Lyu, Chen Chen
cs.AI

摘要

视频帧插值(Video Frame Interpolation, VFI)旨在基于两个连续相邻帧I_0和I_1预测中间帧I_n(此处用n表示视频中的时间,以避免与扩散模型中的时间步t混淆)。近期研究将扩散模型(包括基于图像和基于视频的)应用于此任务,并取得了显著成效。然而,基于图像的扩散模型无法提取时间信息,且相较于非扩散方法效率较低。基于视频的扩散模型虽能提取时间信息,但其在训练规模、模型大小及推理时间上过于庞大。为解决上述问题,我们提出了时间感知潜在布朗桥扩散视频帧插值(Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation, TLB-VFI),一种高效的基于视频的扩散模型。通过我们提出的3D小波门控和时间感知自编码器从视频输入中提取丰富的时间信息,我们的方法在最具挑战性的数据集上,相较于最新的基于图像的扩散模型,FID指标提升了20%。同时,得益于丰富的时间信息,我们的方法在参数数量减少3倍的情况下仍保持强劲性能,这一参数缩减带来了2.3倍的加速。通过引入光流指导,我们的方法所需训练数据减少了9000倍,且参数数量比基于视频的扩散模型少20倍以上。代码与结果详见项目页面:https://zonglinl.github.io/tlbvfi_page。
English
Video Frame Interpolation (VFI) aims to predict the intermediate frame I_n (we use n to denote time in videos to avoid notation overload with the timestep t in diffusion models) based on two consecutive neighboring frames I_0 and I_1. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.
PDF41July 18, 2025