One-Forcing: 邁向穩定的單步自迴歸影片生成

摘要

近期研究显著提升了自回归机制下的实时交互式视频生成能力。然而，大多数现有的少步自回归视频生成方法（通常从对应的多步教师模型蒸馏而来）默认采用4步采样配置，这在部署时仍会产生显著延迟，且当采样步数进一步减少（尤其是单步设置）时，会出现严重的质量退化。轨迹式一致性蒸馏方法往往生成动态较弱的视频，而基于DMD的方法（如Self-Forcing）则倾向于产生模糊帧。为解决这一挑战，我们提出“One-Forcing”，一种简单而有效的方法，通过在DMD目标中增加辅助GAN损失，实现高质量且高效的一步视频生成。在VBench上的实验表明，One-Forcing的总分达到83.76，在一步因果视频生成方法中建立了最先进的性能，并与强大的多步方法保持竞争力。我们进一步证明，仅需分块模型三分之一训练成本，即可稳定实现单步逐帧自回归生成——这一设定在先前方法中未能成功达成。

English

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.