自我驅動++:邁向分鐘級高品質視頻生成
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
October 2, 2025
作者: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
cs.AI
摘要
擴散模型在圖像和視頻生成領域引發了一場革命,達到了前所未有的視覺質量。然而,這些模型依賴於變壓器架構,導致計算成本極高,尤其是在將生成擴展到長視頻時。最近的研究探索了用於長視頻生成的自回歸公式,通常是通過從短時序雙向教師模型中蒸餾知識來實現的。然而,由於教師模型無法合成長視頻,學生模型在超出其訓練時序範圍外的推斷往往會導致顯著的質量下降,這是由於連續潛在空間中誤差的累積所致。在本文中,我們提出了一種簡單而有效的方法,以減輕長時序視頻生成中的質量下降,而無需依賴長視頻教師的監督或對長視頻數據集進行重新訓練。我們的方法核心在於利用教師模型的豐富知識,通過從自生成的長視頻中抽取的片段來為學生模型提供指導。我們的方法在將視頻長度擴展至教師模型能力的20倍時,保持了時間一致性,避免了過曝和誤差累積等常見問題,且無需像先前方法那樣重新計算重疊幀。在計算規模擴大時,我們的方法展示了生成長達4分15秒視頻的能力,這相當於我們基礎模型位置嵌入支持的最大跨度的99.9%,並且比我們的基線模型長度超過50倍。在標準基準測試和我們提出的改進基準測試上的實驗表明,我們的方法在保真度和一致性方面均顯著優於基線方法。我們的長時序視頻演示可在https://self-forcing-plus-plus.github.io/找到。
English
Diffusion models have revolutionized image and video generation, achieving
unprecedented visual quality. However, their reliance on transformer
architectures incurs prohibitively high computational costs, particularly when
extending generation to long videos. Recent work has explored autoregressive
formulations for long video generation, typically by distilling from
short-horizon bidirectional teachers. Nevertheless, given that teacher models
cannot synthesize long videos, the extrapolation of student models beyond their
training horizon often leads to pronounced quality degradation, arising from
the compounding of errors within the continuous latent space. In this paper, we
propose a simple yet effective approach to mitigate quality degradation in
long-horizon video generation without requiring supervision from long-video
teachers or retraining on long video datasets. Our approach centers on
exploiting the rich knowledge of teacher models to provide guidance for the
student model through sampled segments drawn from self-generated long videos.
Our method maintains temporal consistency while scaling video length by up to
20x beyond teacher's capability, avoiding common issues such as over-exposure
and error-accumulation without recomputing overlapping frames like previous
methods. When scaling up the computation, our method shows the capability of
generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the
maximum span supported by our base model's position embedding and more than 50x
longer than that of our baseline model. Experiments on standard benchmarks and
our proposed improved benchmark demonstrate that our approach substantially
outperforms baseline methods in both fidelity and consistency. Our long-horizon
videos demo can be found at https://self-forcing-plus-plus.github.io/