在視頻生成的下幀預測模型中整合輸入幀上下文
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
April 17, 2025
作者: Lvmin Zhang, Maneesh Agrawala
cs.AI
摘要
我們提出了一種名為FramePack的神經網絡結構,用於訓練視頻生成中的下一幀(或下一幀片段)預測模型。FramePack通過壓縮輸入幀,使得變換器的上下文長度成為固定值,而不受視頻長度的影響。因此,我們能夠利用與圖像擴散相似的計算瓶頸來處理大量幀,這也使得訓練視頻的批次大小顯著提高(批次大小與圖像擴散訓練相當)。此外,我們提出了一種抗漂移採樣方法,該方法以倒序時間順序生成幀,並提前確定端點,以避免曝光偏差(迭代過程中誤差的累積)。最後,我們展示了現有的視頻擴散模型可以通過FramePack進行微調,並且由於下一幀預測支持更平衡的擴散調度器,具有較不極端的流動偏移時間步長,其視覺質量可能得到提升。
English
We present a neural network structure, FramePack, to train next-frame (or
next-frame-section) prediction models for video generation. The FramePack
compresses input frames to make the transformer context length a fixed number
regardless of the video length. As a result, we are able to process a large
number of frames using video diffusion with computation bottleneck similar to
image diffusion. This also makes the training video batch sizes significantly
higher (batch sizes become comparable to image diffusion training). We also
propose an anti-drifting sampling method that generates frames in inverted
temporal order with early-established endpoints to avoid exposure bias (error
accumulation over iterations). Finally, we show that existing video diffusion
models can be finetuned with FramePack, and their visual quality may be
improved because the next-frame prediction supports more balanced diffusion
schedulers with less extreme flow shift timesteps.Summary
AI-Generated Summary