SimpleGVR：一种用于潜在级联视频超分辨率的简单基线模型

摘要

潜扩散模型已成为高效视频生成的主导范式。然而，随着用户对高分辨率输出的期望不断提升，仅依赖潜在计算已显不足。一种颇具前景的方法是将生成过程解耦为两个阶段：语义内容生成与细节合成。前者在较低分辨率下采用计算密集型的基模型，而后者则利用轻量级级联视频超分辨率（VSR）模型实现高分辨率输出。本研究聚焦于探索当前尚未充分研究的级联VSR模型的关键设计原则。首先，我们提出了两种退化策略，以生成能更好模拟基模型输出特性的训练对，确保VSR模型与其上游生成器之间的对齐。其次，通过对（1）时间步采样策略和（2）低分辨率（LR）输入噪声增强效果的系统分析，我们提供了关于VSR模型行为的重要洞见，这些发现直接指导了我们的架构与训练创新。最后，我们引入了交错时间单元与稀疏局部注意力机制，以实现高效的训练与推理，大幅降低计算开销。大量实验证明，我们的框架优于现有方法，消融研究进一步验证了各项设计选择的有效性。本研究为级联视频超分辨率生成建立了一个简洁而有效的基线，为未来高效级联合成系统的进步提供了实用的指导。

English

Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.