LongLive-2.0: 긴 비디오 생성을 위한 NVFP4 병렬 인프라

초록

우리는 LongLive-2.0을 제시한다. 이는 긴 비디오 생성의 전체 학습 및 추론 워크플로우에서 속도와 메모리 병목 현상을 해결하는 NVFP4 기반 병렬 인프라스트럭처이다. 학습을 위해 우리는 시퀀스 병렬 자기회귀(AR) 학습을 도입하는데, 이는 Balanced SP로 구현된다. Balanced SP는 각 랭크에서 깨끗한 히스토리와 잡음이 있는 타겟 시간 청크를 쌍으로 연결하여 효율적인 교사 강제 레이아웃을 SP 실행과 공동 설계함으로써, SP 인식 청크 VAE 인코딩을 통한 자연스러운 교사 강제 마스크를 가능하게 한다. NVFP4 정밀도와 결합하여 GPU 메모리 비용을 줄이고 학습 중 GEMM 연산을 가속화하며, 그 비율은 비디오 길이가 증가함에 따라 증가한다. 더욱이 우리는 고품질 인프라와 데이터셋이 현저히 깔끔한 학습 파이프라인을 가능하게 함을 보여준다. ODE 초기화와 이후 분포 정합 증류(DMD)에 의존하는 기존 Self-Forcing 계열 방법과 달리, LongLive-2.0은 확산 모델을 긴 다중 샷 대화형 자기회귀(AR) 확산 모델로 직접 조정한다. 이는 독립형 LoRA 가중치를 사용하여 실시간 생성(4단계에서 2단계 잡음 제거)으로 추가 변환될 수 있다. Blackwell GPU에서의 추론을 위해 우리는 W4A4 NVFP4 추론을 활성화하고, KV 캐시를 NVFP4로 양자화하여 메모리를 절약하며, 비동기 스트리밍 VAE 디코딩으로 종단 간 처리량을 향상시킨다. Blackwell이 아닌 GPU 아키텍처에서는 SP 추론을 배포하여 Blackwell GPU의 속도와 일치시키는 동시에, 양자화된 KV 캐시가 SP의 GPU 간 통신을 줄일 수 있다. 실험 결과 학습에서 최대 2.15배, 추론에서 1.84배의 속도 향상을 보인다. LongLive-2.0-5B는 벤치마크에서 강력한 성능을 달성하면서 45.7 FPS 추론을 달성한다. 우리가 아는 한, LongLive-2.0은 긴 비디오 생성을 위한 최초의 NVFP4 학습 및 추론 시스템이다.

English

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.