ChatPaper.aiChatPaper

SwiftVR:实时单步生成式视频修复

SwiftVR: Real-Time One-Step Generative Video Restoration

June 8, 2026
作者: Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li
cs.AI

摘要

实时视频恢复(VR)在直播场景中需在严格的逐帧延迟约束下输出高分辨率结果。现有基于一步扩散模型的视频恢复方法因两大瓶颈难以部署于消费级GPU:高分辨率下的二次方空间注意力机制,以及大型视频自编码器带来的延迟-显存开销。本文提出SwiftVR——一种基于因果分块协议的流式一步生成式视频恢复框架,能够同时缓解上述两个瓶颈。在注意力机制方面,无掩码移位窗口自注意力通过确定性索引将每个空间窗口汇聚为密集张量,使得所有注意力计算均采用密集缩放点积注意力路径,无需掩码、循环移位、填充或硬件专用稀疏内核。由于SwiftVR仅依赖标准密集缩放点积注意力调用,训练后的模型无需重新训练或定制内核即可迁移至消费级GPU。在自编码方面,轻量级恢复感知自编码器在保证重建质量的同时实现快速分块解码。在单块H100上,SwiftVR在2560×1440分辨率下维持31帧/秒,在3840×2160分辨率下达到14帧/秒,而所有对比的扩散模型基线在4K分辨率下均超出显存限制。在消费级RTX 5090上,SwiftVR在1920×1080分辨率下达到26帧/秒。据我们所知,SwiftVR是首个在消费级GPU上实现实时1080p流式处理的生成式视频恢复模型,同时以更低推理成本取得出色的无参考感知质量。项目地址:https://h-oliday.github.io/SwiftVR
English
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.