SwiftVR: Real-time éénstaps generatieve videorestauratie

Samenvatting

Real-time videorestauratie (VR) voor livestreams vereist hoge-resolutie-outputs onder strikte latentiebeperkingen per frame. Bestaande éénstapsdiffusiegebaseerde VR-modellen blijven moeilijk inzetbaar op consumenten-GPU's vanwege twee belangrijke knelpunten: kwadratische ruimtelijke aandacht bij hoge resoluties en de latentie-geheugenoverhead van grote video-autoencoders. Wij presenteren SwiftVR, een streamend éénstapsgeneratief VR-framework dat beide knelpunten vermindert onder een causaal chunk-gewijs protocol. Voor aandacht gebruikt maskervrije verschoven-venster zelfaandacht die elk ruimtelijk venster verzamelt in een dichte tensor via deterministische indexering, waarbij alle aandachtsaanroepen op het dichte geschaalde puntproduct aandachtspad blijven zonder maskers, cyclische verschuivingen, opvulling of hardwarespecifieke sparse kernels. Omdat SwiftVR alleen standaard dichte SDPA-aanroepen gebruikt, kan het getrainde model worden overgezet naar consumenten-GPU's zonder hertraining of aangepaste kernels. Voor autoencoding maakt een lichte restauratiebewuste autoencoder snelle chunk-gewijze decodering mogelijk met behoud van reconstructiekwaliteit. Op een enkele H100 handhaaft SwiftVR 31 FPS bij 2560x1440 en 14 FPS bij 3840x2160, terwijl alle vergeleken diffusiegebaseerde VR-baselines de geheugenlimiet overschrijden bij 4K. Op een consumenten-RTX 5090 bereikt SwiftVR 26 FPS bij 1920x1080. Voor zover wij weten is SwiftVR het eerste generatieve VR-model dat real-time 1080p-streaming op een consumenten-GPU realiseert, terwijl het een sterke referentieloze perceptuele kwaliteit behaalt met lagere inferentiekosten. Het project is beschikbaar op https://h-oliday.github.io/SwiftVR.

English

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.