驯服教师强制用于遮蔽自回归视频生成

摘要

我们介绍了MAGI，这是一个混合视频生成框架，结合了用于帧内生成的掩码建模和用于下一帧生成的因果建模。我们的关键创新是完全教师强制（CTF），它将掩码帧条件设置为完整观察帧，而不是掩码帧（即掩码教师强制，MTF），从而实现了从标记级（补丁级）到帧级自回归生成的平滑过渡。CTF明显优于MTF，在首帧条件视频预测的FVD分数上实现了+23%的改进。为了解决曝光偏差等问题，我们采用了有针对性的训练策略，在自回归视频生成方面设立了新的基准。实验表明，即使在仅训练了16帧的情况下，MAGI也能生成超过100帧的长、连贯视频序列，突显了其在可扩展、高质量视频生成方面的潜力。

English

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.