SwiftVR: 실시간 단일 단계 생성적 비디오 복원

초록

실시간 라이브 스트리밍을 위한 비디오 복원(VR)은 엄격한 프레임당 지연 시간 제약 조건 하에서 고해상도 출력을 요구한다. 기존의 단일 단계 확산 기반 VR 모델은 두 가지 주요 병목, 즉 고해상도에서의 이차 공간 주의집중(quadratic spatial attention)과 대규모 비디오 오토인코더의 지연-메모리 오버헤드로 인해 소비자용 GPU에 배포하기 어려운 상태이다. 본 논문에서는 인과적 청크 단위 프로토콜(causal chunk-wise protocol) 하에서 두 병목을 모두 완화하는 스트리밍 단일 단계 생성형 VR 프레임워크인 SwiftVR을 제시한다. 주의집중(attention) 측면에서, 마스크 없는 이동 창 자기 주의집중(mask-free shifted-window self-attention)은 결정적 인덱싱을 통해 각 공간 창을 밀집 텐서(dense tensor)로 집계하여, 모든 주의집중 호출이 마스크, 순환 이동, 패딩 또는 하드웨어 특화 희소 커널 없이 밀집 스케일드 닷-프로덕트 어텐션(SDPA) 경로 상에서 이루어지도록 한다. SwiftVR은 표준 밀집 SDPA 호출만 사용하므로, 학습된 모델은 재학습이나 커스텀 커널 없이 소비자 GPU로 이전 가능하다. 오토인코딩 측면에서는 경량의 복원 인지 오토인코더(Restoration-aware Autoencoder)를 통해 재구성 품질을 유지하면서 빠른 청크 단위 디코딩을 가능하게 한다. 단일 H100에서 SwiftVR은 2560x1440 해상도에서 31FPS, 3840x2160에서 14FPS를 유지하는 반면, 비교된 모든 확산 기반 VR 기준 모델은 4K에서 메모리 한계를 초과한다. 소비자용 RTX 5090에서 SwiftVR은 1920x1080 해상도에서 26FPS에 도달한다. 본 연구진이 아는 한, SwiftVR은 소비자급 GPU에서 실시간 1080p 스트리밍을 달성한 최초의 생성형 VR 모델이며, 더 낮은 추론 비용으로 강력한 무참조 지각적 품질을 달성한다. 프로젝트는 https://h-oliday.github.io/SwiftVR에서 확인할 수 있다.

English

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.