SparkVSR：基于稀疏关键帧传播的交互式视频超分辨率技术

摘要

视频超分辨率（VSR）技术旨在从低分辨率视频帧中恢复高质量画面，然而现有方法在推理时往往如同黑箱：用户难以可靠地修正意外伪影，只能被动接受模型输出。本文提出名为SparkVSR的新型交互式VSR框架，将稀疏关键帧转化为简洁而富有表现力的控制信号。具体而言，用户可先使用任意现成的图像超分辨率模型处理少量关键帧，随后SparkVSR在保持原始低分辨率视频运动约束的同时，将关键帧先验信息传播至整个视频序列。我们设计了一种关键帧条件化的潜空间-像素双阶段训练流程，通过融合低分辨率视频潜特征与稀疏编码的高分辨率关键帧潜特征，实现稳健的跨空间传播并优化感知细节。在推理阶段，SparkVSR支持灵活的关键帧选择（手动指定、编解码器I帧提取或随机采样）以及无参考引导机制，持续平衡关键帧遵循度与盲复原效果，即使参考关键帧缺失或不完善也能确保稳健性能。在多组VSR基准测试中，该方法展现出更优的时间一致性与强复原质量，在CLIP-IQA、DOVER和MUSIQ指标上分别较基线提升24.6%、21.8%和5.6%，实现了可控的关键帧驱动视频超分辨率。此外，我们验证了SparkVSR作为通用交互式关键帧条件化视频处理框架的潜力，可直接应用于老胶片修复、视频风格迁移等未见任务。项目页面详见：https://sparkvsr.github.io/

English

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/