LongLive:即時互動長視頻生成
LongLive: Real-time Interactive Long Video Generation
September 26, 2025
作者: Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
cs.AI
摘要
我們提出了LongLive,這是一個用於實時互動式長視頻生成的幀級自回歸(AR)框架。長視頻生成在效率和質量上都面臨挑戰。擴散模型和擴散強制模型能夠生成高質量的視頻,但由於雙向注意力機制導致效率低下。因果注意力AR模型支持KV緩存以實現更快的推理,但在長視頻訓練中由於內存挑戰,往往會導致質量下降。此外,除了基於靜態提示的生成,互動功能(如流式提示輸入)對於動態內容創作至關重要,使用戶能夠實時引導敘事。這一互動需求顯著增加了複雜性,特別是在提示轉換期間確保視覺一致性和語義連貫性方面。為應對這些挑戰,LongLive採用了因果、幀級AR設計,整合了KV重緩存機制,該機制通過新提示刷新緩存狀態,實現平滑、連貫的切換;流式長調優以支持長視頻訓練並對齊訓練和推理(長訓練長測試);以及短窗口注意力與幀級注意力匯聚(簡稱幀匯聚)相結合,在保持長程一致性的同時實現更快的生成。憑藉這些關鍵設計,LongLive僅用32個GPU天數就將一個13億參數的短片段模型微調至分鐘級生成。在推理時,LongLive在單個NVIDIA H100上維持20.7 FPS,在VBench上無論短視頻還是長視頻都表現出色。LongLive在單個H100 GPU上支持長達240秒的視頻。此外,LongLive還支持INT8量化推理,僅有微小的質量損失。
English
We present LongLive, a frame-level autoregressive (AR) framework for
real-time and interactive long video generation. Long video generation presents
challenges in both efficiency and quality. Diffusion and Diffusion-Forcing
models can produce high-quality videos but suffer from low efficiency due to
bidirectional attention. Causal attention AR models support KV caching for
faster inference, but often degrade in quality on long videos due to memory
challenges during long-video training. In addition, beyond static prompt-based
generation, interactive capabilities, such as streaming prompt inputs, are
critical for dynamic content creation, enabling users to guide narratives in
real time. This interactive requirement significantly increases complexity,
especially in ensuring visual consistency and semantic coherence during prompt
transitions. To address these challenges, LongLive adopts a causal, frame-level
AR design that integrates a KV-recache mechanism that refreshes cached states
with new prompts for smooth, adherent switches; streaming long tuning to enable
long video training and to align training and inference (train-long-test-long);
and short window attention paired with a frame-level attention sink, shorten as
frame sink, preserving long-range consistency while enabling faster generation.
With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model
to minute-long generation in just 32 GPU-days. At inference, LongLive sustains
20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both
short and long videos. LongLive supports up to 240-second videos on a single
H100 GPU. LongLive further supports INT8-quantized inference with only marginal
quality loss.