FlashVSR:邁向基於擴散模型的即時串流影片超解析度
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
October 14, 2025
作者: Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue
cs.AI
摘要
擴散模型近期在視頻修復領域取得了進展,但將其應用於現實世界的視頻超分辨率(VSR)仍面臨高延遲、計算成本過高以及對超高分辨率泛化能力不足的挑戰。本研究的目標是通過實現效率、可擴展性和實時性能,使基於擴散的VSR技術變得實用。為此,我們提出了FlashVSR,這是首個面向實時VSR的基於擴散的一步流式框架。FlashVSR在單個A100 GPU上對768x1408視頻的處理速度約為17 FPS,這得益於三項互補的創新:(i) 一個訓練友好的三階段蒸餾管道,實現了流式超分辨率;(ii) 局部約束的稀疏注意力機制,在彌合訓練與測試分辨率差距的同時,削減了冗餘計算;(iii) 一個微型條件解碼器,在不犧牲質量的前提下加速了重建過程。為了支持大規模訓練,我們還構建了VSR-120K,這是一個包含12萬個視頻和18萬張圖像的新數據集。大量實驗表明,FlashVSR能夠可靠地擴展到超高分辨率,並以相較於先前一步擴散VSR模型最高12倍的速度提升,達到了業界領先的性能。我們將公開代碼、預訓練模型和數據集,以促進基於擴散的高效VSR技術的未來研究。
English
Diffusion models have recently advanced video restoration, but applying them
to real-world video super-resolution (VSR) remains challenging due to high
latency, prohibitive computation, and poor generalization to ultra-high
resolutions. Our goal in this work is to make diffusion-based VSR practical by
achieving efficiency, scalability, and real-time performance. To this end, we
propose FlashVSR, the first diffusion-based one-step streaming framework
towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408
videos on a single A100 GPU by combining three complementary innovations: (i) a
train-friendly three-stage distillation pipeline that enables streaming
super-resolution, (ii) locality-constrained sparse attention that cuts
redundant computation while bridging the train-test resolution gap, and (iii) a
tiny conditional decoder that accelerates reconstruction without sacrificing
quality. To support large-scale training, we also construct VSR-120K, a new
dataset with 120k videos and 180k images. Extensive experiments show that
FlashVSR scales reliably to ultra-high resolutions and achieves
state-of-the-art performance with up to 12x speedup over prior one-step
diffusion VSR models. We will release the code, pretrained models, and dataset
to foster future research in efficient diffusion-based VSR.