Self-Forcing++: 분 단위 고화질 비디오 생성을 향하여

초록

확산 모델은 이미지 및 비디오 생성 분야에 혁신을 가져와 전례 없는 시각적 품질을 달성했습니다. 그러나 이러한 모델들은 트랜스포머 아키텍처에 의존함에 따라 특히 긴 비디오 생성을 확장할 때 과도하게 높은 계산 비용을 초래합니다. 최근 연구에서는 일반적으로 단기간 양방향 교사 모델로부터 지식을 추출하여 긴 비디오 생성을 위한 자기회귀적 접근 방식을 탐구해 왔습니다. 그러나 교사 모델이 긴 비디오를 합성할 수 없기 때문에, 학생 모델이 훈련 범위를 넘어서는 경우 연속적인 잠재 공간 내에서 오류가 누적되어 심각한 품질 저하가 발생하는 문제가 있습니다. 본 논문에서는 긴 비디오 교사 모델의 감독이나 긴 비디오 데이터셋에 대한 재훈련 없이도 긴 시간대 비디오 생성에서의 품질 저하를 완화하기 위한 간단하면서도 효과적인 접근 방식을 제안합니다. 우리의 접근 방식은 교사 모델의 풍부한 지식을 활용하여, 자체 생성된 긴 비디오에서 추출한 세그먼트를 통해 학생 모델에게 지침을 제공하는 데 중점을 둡니다. 우리의 방법은 교사 모델의 능력을 최대 20배까지 확장하면서도 시간적 일관성을 유지하며, 이전 방법들과 달리 겹치는 프레임을 재계산하지 않으면서도 과도한 노출 및 오류 누적과 같은 일반적인 문제를 피합니다. 계산을 확장할 때, 우리의 방법은 기본 모델의 위치 임베딩이 지원하는 최대 범위의 99.9%에 해당하는 4분 15초 길이의 비디오를 생성할 수 있는 능력을 보여주며, 이는 기준 모델보다 50배 이상 긴 길이입니다. 표준 벤치마크와 우리가 제안한 개선된 벤치마크에 대한 실험을 통해, 우리의 접근 방식이 충실도와 일관성 모두에서 기준 방법들을 크게 능가함을 입증했습니다. 우리의 긴 시간대 비디오 데모는 https://self-forcing-plus-plus.github.io/에서 확인할 수 있습니다.

English

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

Self-Forcing++: 분 단위 고화질 비디오 생성을 향하여

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

초록

Support