One-Forcing: 안정적인 단일 단계 자기회귀 비디오 생성을 향하여

초록

최근 연구 발전을 통해 자기회귀 체계에서 실시간 대화형 동영상 생성이 상당히 개선되었다. 그러나 기존의 대부분 저단계 자기회귀 동영상 생성 방법은, 종종 해당하는 다단계 교사 모델로부터 증류된 것인데, 기본적으로 4단계 샘플링 구성을 사용하여 배포 시 상당한 지연 시간을 초래하며, 샘플링 단계가 더욱 줄어들 경우, 특히 일단계 설정에서 심각한 품질 저하를 겪는다. 궤적 스타일 일관성 증류 방법은 종종 역학이 약한 동영상을 생성하는 반면, Self-Forcing과 같은 DMD 기반 접근법은 흐릿한 프레임을 생성하는 경향이 있다. 이 문제를 해결하기 위해, 우리는 고품질의 효율적인 일단계 동영상 생성에 DMD 목적함수를 보조 GAN 손실로 보강하는 간단하면서도 효과적인 접근법인 One-Forcing을 제안한다. VBench 실험 결과, One-Forcing은 총점 83.76을 달성하여 일단계 인과적 동영상 생성 방법 중 최첨단 성능을 확립했으며, 강력한 다단계 방법들과도 경쟁력을 유지했다. 또한, 기존 방법들이 성공적으로 달성하지 못했던 설정인 청크 단위 모델의 3분의 1의 훈련 비용만으로도 안정적으로 일단계 프레임별 자기회귀 생성이 가능함을 입증한다.

English

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.