SwiftVR: 实时一步式生成式视频修复
SwiftVR: Real-Time One-Step Generative Video Restoration
June 8, 2026
作者: Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li
cs.AI
摘要
即時直播的影片修復(VR)需要在嚴格的每幀延遲限制下產生高解析度輸出。現有的一次性擴散式VR模型仍難以部署在消費級GPU上,主要有兩個瓶頸:高解析度下的二次空間注意力,以及大型影片自編碼器的延遲-記憶體開銷。我們提出SwiftVR,一個串流一次性生成式VR框架,在因果區塊式協定下減少這兩個瓶頸。在注意力方面,無遮罩移位視窗自注意力透過確定性索引將每個空間視窗收集成密集張量,使所有注意力呼叫保持在密集縮放點積注意力路徑上,無需遮罩、循環移位、填補或硬體特定稀疏核。由於SwiftVR僅使用標準的密集SDPA呼叫,訓練好的模型可直接遷移至消費級GPU,無需重新訓練或自訂核心。在自編碼方面,輕量級修復感知自編碼器實現快速區塊式解碼,同時保持重建品質。在單張H100上,SwiftVR在2560x1440下維持約31FPS,在3840x2160下維持約14FPS,而所有比較的基於擴散的VR基線在4K下均超出記憶體限制。在消費級RTX 5090上,SwiftVR在1920x1080下達到約26FPS。據我們所知,SwiftVR是首個在消費級GPU上實現即時1080p串流的生成式VR模型,同時以較低推論成本達到強大的無參考感知品質。專案網址為 https://h-oliday.github.io/SwiftVR。
English
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.