SimpleGVR:潛在級聯視頻超分辨率的簡易基線
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution
June 24, 2025
作者: Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang, Fanghua Yu, Ziyan Chen, Pengfei Wan, Jiantao Zhou, Chao Dong
cs.AI
摘要
潛在擴散模型已成為高效視頻生成的主要範式。然而,隨著用戶對高分辨率輸出的期望不斷提升,僅依賴潛在計算已顯不足。一種有前景的方法將過程解耦為兩個階段:語義內容生成與細節合成。前者在較低分辨率下採用計算密集型的基礎模型,而後者則利用輕量級級聯視頻超分辨率(VSR)模型來實現高分辨率輸出。本研究專注於探討當前尚未充分研究的級聯VSR模型的關鍵設計原則。首先,我們提出了兩種退化策略,以生成更能模擬基礎模型輸出特徵的訓練對,確保VSR模型與其上游生成器之間的一致性。其次,通過系統分析(1)時間步採樣策略,(2)低分辨率(LR)輸入上的噪聲增強效應,我們為VSR模型行為提供了關鍵見解。這些發現直接指導了我們的架構與訓練創新。最後,我們引入了交錯時間單元與稀疏局部注意力機制,以實現高效的訓練與推理,大幅降低計算開銷。大量實驗證明了我們框架相較於現有方法的優越性,消融研究也確認了每項設計選擇的有效性。我們的工作為級聯視頻超分辨率生成建立了一個簡單而有效的基準,為未來高效級聯合成系統的發展提供了實用指導。
English
Latent diffusion models have emerged as a leading paradigm for efficient
video generation. However, as user expectations shift toward higher-resolution
outputs, relying solely on latent computation becomes inadequate. A promising
approach involves decoupling the process into two stages: semantic content
generation and detail synthesis. The former employs a computationally intensive
base model at lower resolutions, while the latter leverages a lightweight
cascaded video super-resolution (VSR) model to achieve high-resolution output.
In this work, we focus on studying key design principles for latter cascaded
VSR models, which are underexplored currently. First, we propose two
degradation strategies to generate training pairs that better mimic the output
characteristics of the base model, ensuring alignment between the VSR model and
its upstream generator. Second, we provide critical insights into VSR model
behavior through systematic analysis of (1) timestep sampling strategies, (2)
noise augmentation effects on low-resolution (LR) inputs. These findings
directly inform our architectural and training innovations. Finally, we
introduce interleaving temporal unit and sparse local attention to achieve
efficient training and inference, drastically reducing computational overhead.
Extensive experiments demonstrate the superiority of our framework over
existing methods, with ablation studies confirming the efficacy of each design
choice. Our work establishes a simple yet effective baseline for cascaded video
super-resolution generation, offering practical insights to guide future
advancements in efficient cascaded synthesis systems.