FlashVSR: Op Weg naar Real-Time Diffusiegebaseerde Streaming Video Superresolutie

Samenvatting

Diffusiemodellen hebben recentelijk vooruitgang geboekt in videorestauratie, maar het toepassen ervan op real-world video super-resolutie (VSR) blijft uitdagend vanwege hoge latentie, excessieve rekenkracht en slechte generalisatie naar ultra-hoge resoluties. Ons doel in dit werk is om diffusie-gebaseerde VSR praktisch te maken door efficiëntie, schaalbaarheid en real-time prestaties te bereiken. Hiertoe stellen we FlashVSR voor, het eerste diffusie-gebaseerde één-staps streaming framework gericht op real-time VSR. FlashVSR draait op ongeveer 17 FPS voor 768x1408 video's op een enkele A100 GPU door drie complementaire innovaties te combineren: (i) een train-vriendelijke drie-fasen distillatiepijplijn die streaming super-resolutie mogelijk maakt, (ii) lokaal-gebonden sparse aandacht die overbodige berekeningen vermindert terwijl de kloof tussen train- en testresolutie wordt overbrugd, en (iii) een kleine conditionele decoder die reconstructie versnelt zonder kwaliteit op te offeren. Om grootschalige training te ondersteunen, hebben we ook VSR-120K geconstrueerd, een nieuwe dataset met 120k video's en 180k afbeeldingen. Uitgebreide experimenten tonen aan dat FlashVSR betrouwbaar schaalt naar ultra-hoge resoluties en state-of-the-art prestaties bereikt met een versnelling tot 12x ten opzichte van eerdere één-staps diffusie VSR-modellen. We zullen de code, voorgetrainde modellen en dataset vrijgeven om toekomstig onderzoek in efficiënte diffusie-gebaseerde VSR te bevorderen.

English

Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.

FlashVSR: Op Weg naar Real-Time Diffusiegebaseerde Streaming Video Superresolutie

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Samenvatting

Support