LongLive-2.0: 長編動画生成のためのNVFP4並列インフラストラクチャ

要旨

本稿では、NVFP4に基づく並列インフラストラクチャ「LongLive-2.0」を提案する。これは長編動画生成の学習・推論ワークフロー全体を対象とし、速度とメモリのボトルネックに対処するものである。学習においては、シーケンス並列自己回帰（AR）学習を導入し、バランス型SPとして具体化する。これは、各ランク上でクリーンな履歴とノイズを含むターゲットの時間的チャンクをペアリングすることにより、効率的な教師強制レイアウトとSP実行を共設計し、SP対応チャンク型VAE符号化を伴う自然な教師強制マスクを実現する。NVFP4精度と組み合わせることで、学習中のGPUメモリコストを削減し、GEMM計算を高速化する（GEMMの割合は動画長の増加に伴い増大する）。さらに、高品質なインフラとデータセットにより、極めてクリーンな学習パイプラインが可能になることを示す。ODE初期化とその後の分布マッチング蒸留（DMD）に依存する既存のSelf-Forcing系列手法とは異なり、LongLive-2.0は拡散モデルを直接、長編・マルチショット・インタラクティブ自己回帰（AR）拡散モデルへと調整する。これは、単独のLoRA重みにより、さらにリアルタイム生成（4～2段階のノイズ除去ステップ）に変換可能である。Blackwell GPU上での推論では、W4A4 NVFP4推論を有効化し、KVキャッシュをNVFP4に量子化してメモリ節約を図るとともに、非同期ストリーミングVAE復号によりエンドツーエンドのスループットを向上させる。Blackwell以外のGPUアーキテクチャでは、SP推論を展開してBlackwell GPUと同等の速度を実現し、量子化KVキャッシュはSPのGPU間通信を削減する。実験では、学習で最大2.15倍、推論で最大1.84倍の高速化を達成した。LongLive-2.0-5Bは、ベンチマークで強力な性能を維持しつつ、45.7 FPSの推論を実現する。我々の知る限り、LongLive-2.0は長編動画生成のための初のNVFP4学習・推論システムである。

English

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.