自迴歸對抗式後訓練於即時互動視頻生成之應用

摘要

現有的大規模視頻生成模型計算密集，難以應用於實時和互動場景。本研究提出了一種自迴歸對抗性後訓練（AAPT）方法，將預訓練的潛在視頻擴散模型轉化為實時互動視頻生成器。我們的模型通過單次神經函數評估（1NFE）自迴歸地逐幀生成潛在幀，能夠實時向用戶流式傳輸結果，並接收互動響應作為控制信號來生成下一潛在幀。與現有方法不同，我們的方法探索了對抗性訓練作為自迴歸生成的有效範式。這不僅使我們能夠設計出更適合一步生成且充分利用KV緩存的架構，還能夠以學生強制方式訓練模型，有效減少長視頻生成過程中的誤差累積。實驗表明，我們的80億參數模型在單塊H100上實現了736x416分辨率、24幀/秒的實時流式視頻生成，或在8塊H100上實現1280x720分辨率、長達一分鐘（1440幀）的生成。更多詳情請訪問我們的研究網站：https://seaweed-apt.com/2。

English

Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

自迴歸對抗式後訓練於即時互動視頻生成之應用

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

摘要

Support