SimpleGVR：潛在級聯視頻超分辨率的簡易基線

摘要

潛在擴散模型已成為高效視頻生成的主要範式。然而，隨著用戶對高分辨率輸出的期望不斷提升，僅依賴潛在計算已顯不足。一種有前景的方法將過程解耦為兩個階段：語義內容生成與細節合成。前者在較低分辨率下採用計算密集型的基礎模型，而後者則利用輕量級級聯視頻超分辨率（VSR）模型來實現高分辨率輸出。本研究專注於探討當前尚未充分研究的級聯VSR模型的關鍵設計原則。首先，我們提出了兩種退化策略，以生成更能模擬基礎模型輸出特徵的訓練對，確保VSR模型與其上游生成器之間的一致性。其次，通過系統分析（1）時間步採樣策略，（2）低分辨率（LR）輸入上的噪聲增強效應，我們為VSR模型行為提供了關鍵見解。這些發現直接指導了我們的架構與訓練創新。最後，我們引入了交錯時間單元與稀疏局部注意力機制，以實現高效的訓練與推理，大幅降低計算開銷。大量實驗證明了我們框架相較於現有方法的優越性，消融研究也確認了每項設計選擇的有效性。我們的工作為級聯視頻超分辨率生成建立了一個簡單而有效的基準，為未來高效級聯合成系統的發展提供了實用指導。

English

Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.