LongLive:实时交互式长视频生成
LongLive: Real-time Interactive Long Video Generation
September 26, 2025
作者: Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
cs.AI
摘要
我们推出LongLive,一种帧级自回归(AR)框架,用于实时交互式长视频生成。长视频生成在效率和质量方面均面临挑战。扩散模型和扩散强制模型虽能生成高质量视频,但由于双向注意力机制导致效率低下。因果注意力AR模型支持KV缓存以实现更快推理,但在长视频训练中因内存问题往往导致质量下降。此外,超越静态提示生成,交互能力(如流式提示输入)对于动态内容创作至关重要,使用户能实时引导叙事。这一交互需求显著增加了复杂性,特别是在提示转换期间确保视觉一致性和语义连贯性方面。为应对这些挑战,LongLive采用因果帧级AR设计,集成了KV重缓存机制,通过新提示刷新缓存状态以实现平滑、紧密的切换;流式长调优支持长视频训练并实现训练与推理对齐(长训长测);以及短窗口注意力与帧级注意力汇聚点(简称帧汇聚)相结合,在保持长程一致性的同时加速生成。凭借这些关键设计,LongLive仅用32个GPU天便将一个13亿参数的短片段模型微调至分钟级生成。在推理时,LongLive在单个NVIDIA H100上维持20.7 FPS,在VBench上无论是短视频还是长视频均表现出色,支持单个H100 GPU上长达240秒的视频生成,并进一步支持INT8量化推理,仅带来轻微质量损失。
English
We present LongLive, a frame-level autoregressive (AR) framework for
real-time and interactive long video generation. Long video generation presents
challenges in both efficiency and quality. Diffusion and Diffusion-Forcing
models can produce high-quality videos but suffer from low efficiency due to
bidirectional attention. Causal attention AR models support KV caching for
faster inference, but often degrade in quality on long videos due to memory
challenges during long-video training. In addition, beyond static prompt-based
generation, interactive capabilities, such as streaming prompt inputs, are
critical for dynamic content creation, enabling users to guide narratives in
real time. This interactive requirement significantly increases complexity,
especially in ensuring visual consistency and semantic coherence during prompt
transitions. To address these challenges, LongLive adopts a causal, frame-level
AR design that integrates a KV-recache mechanism that refreshes cached states
with new prompts for smooth, adherent switches; streaming long tuning to enable
long video training and to align training and inference (train-long-test-long);
and short window attention paired with a frame-level attention sink, shorten as
frame sink, preserving long-range consistency while enabling faster generation.
With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model
to minute-long generation in just 32 GPU-days. At inference, LongLive sustains
20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both
short and long videos. LongLive supports up to 240-second videos on a single
H100 GPU. LongLive further supports INT8-quantized inference with only marginal
quality loss.