下一個區塊預測:透過半自回歸建模生成影片
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
February 11, 2025
作者: Shuhuai Ren, Shuming Ma, Xu Sun, Furu Wei
cs.AI
摘要
下一個標記預測(NTP)是自回歸(AR)視頻生成的實際方法,但存在著次優的單向依賴性和緩慢的推理速度。在這項工作中,我們提出了一種半自回歸(半AR)框架,稱為下一個區塊預測(NBP),用於視頻生成。通過將視頻內容均勻分解為相等大小的區塊(例如,行或幀),我們將生成單元從個別標記轉移到區塊,使得當前區塊中的每個標記可以同時預測下一個區塊中對應的標記。與傳統的AR建模不同,我們的框架在每個區塊內使用雙向注意力,使標記能夠捕獲更強大的空間依賴性。通過並行預測多個標記,NBP模型顯著減少了生成步驟的數量,從而實現更快速和更高效的推理。我們的模型在UCF101上實現了103.3的FVD分數,在K600上實現了25.5的FVD分數,比普通的NTP模型平均提高了4.4。此外,由於推理步驟的減少,NBP模型每秒生成8.89幀(128x128分辨率),實現了11倍的加速。我們還探索了從700M到3B參數範圍的模型規模,觀察到生成質量的顯著改善,UCF101上的FVD分數從103.3下降到55.3,K600上的FVD分數從25.5下降到19.5,展示了我們方法的可擴展性。
English
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR)
video generation, but it suffers from suboptimal unidirectional dependencies
and slow inference speed. In this work, we propose a semi-autoregressive
(semi-AR) framework, called Next-Block Prediction (NBP), for video generation.
By uniformly decomposing video content into equal-sized blocks (e.g., rows or
frames), we shift the generation unit from individual tokens to blocks,
allowing each token in the current block to simultaneously predict the
corresponding token in the next block. Unlike traditional AR modeling, our
framework employs bidirectional attention within each block, enabling tokens to
capture more robust spatial dependencies. By predicting multiple tokens in
parallel, NBP models significantly reduce the number of generation steps,
leading to faster and more efficient inference. Our model achieves FVD scores
of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an
average of 4.4. Furthermore, thanks to the reduced number of inference steps,
the NBP model generates 8.89 frames (128x128 resolution) per second, achieving
an 11x speedup. We also explored model scales ranging from 700M to 3B
parameters, observing significant improvements in generation quality, with FVD
scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600,
demonstrating the scalability of our approach.Summary
AI-Generated Summary