LongLive-2.0：面向长视频生成的NVFP4并行基础架构

摘要

我们提出 LongLive-2.0，这是一个基于 NVFP4 的并行基础设施，覆盖长视频生成的完整训练和推理流程，旨在解决速度和内存瓶颈。在训练方面，我们引入了序列并行自回归（AR）训练，并实例化为 Balanced SP。该方法通过在每个 rank 上配对干净历史块和带噪声目标块，将高效的教师强制布局与 SP 执行协同设计，实现了自然的教师强制掩码与 SP 感知的分块 VAE 编码。结合 NVFP4 精度，它降低了 GPU 内存开销，并加速了训练期间的 GEMM 计算，其占比随着视频长度增加而上升。此外，我们展示了高质量的基础设施和数据集能够实现异常简洁的训练流程。与现有依赖 ODE 初始化和后续分布匹配蒸馏（DMD）的 Self-Forcing 系列方法不同，LongLive-2.0 直接将扩散模型微调为长视频、多镜头、交互式自回归（AR）扩散模型。它还可以进一步转换为实时生成（4 到 2 步去噪），配备独立的 LoRA 权重。在 Blackwell GPU 上进行推理时，我们支持 W4A4 NVFP4 推理，将 KV 缓存量化为 NVFP4 以节省内存，并通过异步流式 VAE 解码提升端到端吞吐量。在非 Blackwell GPU 架构上，我们部署 SP 推理以匹配 Blackwell GPU 的速度，同时量化的 KV 缓存可以降低 SP 的 GPU 间通信。实验显示训练速度提升高达 2.15 倍，推理速度提升 1.84 倍。LongLive-2.0-5B 实现了 45.7 FPS 的推理速度，同时在基准测试中表现强劲。据我们所知，LongLive-2.0 是首个用于长视频生成的 NVFP4 训练和推理系统。

English

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.