FlashVSR:迈向基于扩散模型的实时流媒体视频超分辨率
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
October 14, 2025
作者: Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue
cs.AI
摘要
扩散模型近期在视频修复领域取得了显著进展,但将其应用于现实世界的视频超分辨率(VSR)仍面临高延迟、计算成本巨大以及对超高分辨率泛化能力不足等挑战。本研究的目标是通过实现效率、可扩展性和实时性能,使基于扩散的VSR技术走向实用化。为此,我们提出了FlashVSR,这是首个面向实时VSR的基于扩散的一步流式处理框架。FlashVSR在单块A100 GPU上对768x1408视频的处理速度约为17帧/秒,这得益于三项互补的创新:(i) 一种训练友好的三阶段蒸馏管道,支持流式超分辨率处理;(ii) 局部约束的稀疏注意力机制,在减少冗余计算的同时弥合训练与测试分辨率之间的差距;(iii) 一个微型条件解码器,在不牺牲质量的前提下加速重建过程。为了支持大规模训练,我们还构建了VSR-120K,一个包含12万段视频和18万张图像的新数据集。大量实验表明,FlashVSR能够可靠地扩展至超高分辨率,并以高达12倍的速度超越现有的一步扩散VSR模型,达到业界领先的性能。我们将公开代码、预训练模型及数据集,以促进基于扩散的高效VSR技术的未来研究。
English
Diffusion models have recently advanced video restoration, but applying them
to real-world video super-resolution (VSR) remains challenging due to high
latency, prohibitive computation, and poor generalization to ultra-high
resolutions. Our goal in this work is to make diffusion-based VSR practical by
achieving efficiency, scalability, and real-time performance. To this end, we
propose FlashVSR, the first diffusion-based one-step streaming framework
towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408
videos on a single A100 GPU by combining three complementary innovations: (i) a
train-friendly three-stage distillation pipeline that enables streaming
super-resolution, (ii) locality-constrained sparse attention that cuts
redundant computation while bridging the train-test resolution gap, and (iii) a
tiny conditional decoder that accelerates reconstruction without sacrificing
quality. To support large-scale training, we also construct VSR-120K, a new
dataset with 120k videos and 180k images. Extensive experiments show that
FlashVSR scales reliably to ultra-high resolutions and achieves
state-of-the-art performance with up to 12x speedup over prior one-step
diffusion VSR models. We will release the code, pretrained models, and dataset
to foster future research in efficient diffusion-based VSR.