馴服教師強迫在遮罩自回歸視頻生成中
Taming Teacher Forcing for Masked Autoregressive Video Generation
January 21, 2025
作者: Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum
cs.AI
摘要
我們介紹了MAGI,一個混合視頻生成框架,結合了遮罩建模用於幀內生成和因果建模用於下一幀生成。我們的關鍵創新是完整教師強制(CTF),將遮罩幀條件設置為完整觀察幀而不是遮罩幀(即遮罩教師強制,MTF),從而實現從令牌級(補丁級)到幀級自回歸生成的平滑過渡。CTF明顯優於MTF,在首幀條件視頻預測的FVD分數上實現了+23%的改進。為解決曝光偏差等問題,我們採用了有針對性的訓練策略,在自回歸視頻生成方面設定了新的基準。實驗表明,即使在僅訓練16幀的情況下,MAGI也能生成超過100幀的長篇、連貫的視頻序列,突顯了其在可擴展、高質量視頻生成方面的潛力。
English
We introduce MAGI, a hybrid video generation framework that combines masked
modeling for intra-frame generation with causal modeling for next-frame
generation. Our key innovation, Complete Teacher Forcing (CTF), conditions
masked frames on complete observation frames rather than masked ones (namely
Masked Teacher Forcing, MTF), enabling a smooth transition from token-level
(patch-level) to frame-level autoregressive generation. CTF significantly
outperforms MTF, achieving a +23% improvement in FVD scores on first-frame
conditioned video prediction. To address issues like exposure bias, we employ
targeted training strategies, setting a new benchmark in autoregressive video
generation. Experiments show that MAGI can generate long, coherent video
sequences exceeding 100 frames, even when trained on as few as 16 frames,
highlighting its potential for scalable, high-quality video generation.Summary
AI-Generated Summary