驯服教师强制用于遮蔽自回归视频生成
Taming Teacher Forcing for Masked Autoregressive Video Generation
January 21, 2025
作者: Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum
cs.AI
摘要
我们介绍了MAGI,这是一个混合视频生成框架,结合了用于帧内生成的掩码建模和用于下一帧生成的因果建模。我们的关键创新是完全教师强制(CTF),它将掩码帧条件设置为完整观察帧,而不是掩码帧(即掩码教师强制,MTF),从而实现了从标记级(补丁级)到帧级自回归生成的平滑过渡。CTF明显优于MTF,在首帧条件视频预测的FVD分数上实现了+23%的改进。为了解决曝光偏差等问题,我们采用了有针对性的训练策略,在自回归视频生成方面设立了新的基准。实验表明,即使在仅训练了16帧的情况下,MAGI也能生成超过100帧的长、连贯视频序列,突显了其在可扩展、高质量视频生成方面的潜力。
English
We introduce MAGI, a hybrid video generation framework that combines masked
modeling for intra-frame generation with causal modeling for next-frame
generation. Our key innovation, Complete Teacher Forcing (CTF), conditions
masked frames on complete observation frames rather than masked ones (namely
Masked Teacher Forcing, MTF), enabling a smooth transition from token-level
(patch-level) to frame-level autoregressive generation. CTF significantly
outperforms MTF, achieving a +23% improvement in FVD scores on first-frame
conditioned video prediction. To address issues like exposure bias, we employ
targeted training strategies, setting a new benchmark in autoregressive video
generation. Experiments show that MAGI can generate long, coherent video
sequences exceeding 100 frames, even when trained on as few as 16 frames,
highlighting its potential for scalable, high-quality video generation.Summary
AI-Generated Summary