SwiftVR: リアルタイム・ワンステップ生成型ビデオ修復

要旨

ライブストリーム向けのリアルタイム動画復元（VR）では、厳格なフレーム単位のレイテンシ制約のもとで高解像度出力が求められる。既存のワンステップ拡散ベースVRモデルは、高解像度における二次的な空間的アテンションと、大規模動画オートエンコーダによるレイテンシ・メモリオーバーヘッドという二つの主要なボトルネックにより、コンシューマー向けGPUへのデプロイが依然として困難である。本稿では、因果的チャンク単位のプロトコル下で両ボトルネックを低減するストリーミング型ワンステップ生成VRフレームワークSwiftVRを提案する。アテンションに関しては、マスク不要のシフトドウィンドウ自己アテンションが決定論的インデキシングにより各空間ウィンドウを密テンソルに集約し、マスク、巡回シフト、パディング、ハードウェア固有のスパースカーネルを用いずに全アテンション呼び出しを密スケールドット積アテンションパス上に維持する。SwiftVRは標準的な密SDPA呼び出しのみを使用するため、学習済みモデルは再学習やカスタムカーネルなしでコンシューマーGPUに移植できる。オートエンコーディングについては、軽量な復元認識オートエンコーダにより復元品質を維持しつつ高速なチャンク単位の復号を実現する。単一H100上で、SwiftVRは2560×1440にて約31FPS、3840×2160にて約14FPSを達成し、比較対象の拡散ベースVRベースラインは全て4Kでメモリ制限を超過する。コンシューマー向けRTX 5090では、SwiftVRは1920×1080にて26FPSに達する。我々の知る限り、SwiftVRはコンシューマー向けGPU上でリアルタイム1080pストリーミングを実現した初の生成VRモデルであり、低い推論コストで強力なノーリファレンス知覚品質を達成する。プロジェクトはhttps://h-oliday.github.io/SwiftVRで公開されている。

English

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.