自回归对抗性后训练用于实时交互式视频生成

摘要

现有的大规模视频生成模型计算密集，难以应用于实时交互场景。本研究提出了一种自回归对抗性后训练方法（AAPT），将预训练的潜在视频扩散模型转化为实时交互式视频生成器。我们的模型通过单次神经网络函数评估（1NFE）自回归地逐帧生成潜在帧。该模型能够实时向用户流式传输生成结果，并接收交互响应作为控制信号来生成下一潜在帧。与现有方法不同，我们的方法探索了对抗训练作为自回归生成的有效范式。这不仅使我们能够设计出更高效的单步生成架构，同时充分利用KV缓存，还支持以学生强制方式进行模型训练，这在减少长视频生成过程中的误差累积方面效果显著。实验表明，我们的80亿参数模型在单块H100上实现了736x416分辨率的24fps实时流式视频生成，或在8块H100上实现1280x720分辨率、长达一分钟（1440帧）的视频生成。访问我们的研究网站https://seaweed-apt.com/2获取更多信息。

English

Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

自回归对抗性后训练用于实时交互式视频生成

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

摘要

Support