自强化++:迈向分钟级高质量视频生成
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
October 2, 2025
作者: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
cs.AI
摘要
扩散模型在图像和视频生成领域掀起了一场革命,实现了前所未有的视觉质量。然而,其对Transformer架构的依赖带来了极高的计算成本,尤其是在生成长视频时更为显著。近期研究探索了自回归式长视频生成方法,通常通过从短时双向教师模型中蒸馏知识来实现。然而,由于教师模型无法生成长视频,学生模型在训练范围之外进行外推时,往往会导致质量显著下降,这源于连续潜在空间中错误的累积。本文提出了一种简单而有效的方法,旨在缓解长时视频生成中的质量退化问题,且无需依赖长视频教师模型的监督或对长视频数据集进行重新训练。我们的方法核心在于利用教师模型的丰富知识,通过从自生成的长视频中抽取片段,为学生模型提供指导。该方法在将视频长度扩展至教师模型能力的20倍时,仍能保持时间一致性,避免了过度曝光和错误累积等常见问题,且无需像先前方法那样重新计算重叠帧。在计算资源增加的情况下,我们的方法能够生成长达4分15秒的视频,这相当于基础模型位置嵌入支持的最大跨度的99.9%,比基线模型长50倍以上。在标准基准测试和我们提出的改进基准测试上的实验表明,我们的方法在保真度和一致性方面均显著优于基线方法。我们的长时视频演示可在https://self-forcing-plus-plus.github.io/查看。
English
Diffusion models have revolutionized image and video generation, achieving
unprecedented visual quality. However, their reliance on transformer
architectures incurs prohibitively high computational costs, particularly when
extending generation to long videos. Recent work has explored autoregressive
formulations for long video generation, typically by distilling from
short-horizon bidirectional teachers. Nevertheless, given that teacher models
cannot synthesize long videos, the extrapolation of student models beyond their
training horizon often leads to pronounced quality degradation, arising from
the compounding of errors within the continuous latent space. In this paper, we
propose a simple yet effective approach to mitigate quality degradation in
long-horizon video generation without requiring supervision from long-video
teachers or retraining on long video datasets. Our approach centers on
exploiting the rich knowledge of teacher models to provide guidance for the
student model through sampled segments drawn from self-generated long videos.
Our method maintains temporal consistency while scaling video length by up to
20x beyond teacher's capability, avoiding common issues such as over-exposure
and error-accumulation without recomputing overlapping frames like previous
methods. When scaling up the computation, our method shows the capability of
generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the
maximum span supported by our base model's position embedding and more than 50x
longer than that of our baseline model. Experiments on standard benchmarks and
our proposed improved benchmark demonstrate that our approach substantially
outperforms baseline methods in both fidelity and consistency. Our long-horizon
videos demo can be found at https://self-forcing-plus-plus.github.io/