LongLive: 실시간 인터랙티브 장편 비디오 생성

초록

본 논문에서는 실시간 및 상호작용형 장기 비디오 생성을 위한 프레임 단위 자기회귀(AR) 프레임워크인 LongLive를 소개한다. 장기 비디오 생성은 효율성과 품질 측면에서 모두 도전 과제를 제시한다. 확산(Diffusion) 및 확산 강제(Diffusion-Forcing) 모델은 고품질 비디오를 생성할 수 있지만, 양방향 주의 메커니즘으로 인해 효율성이 낮다. 반면, 인과적 주의 메커니즘을 사용하는 AR 모델은 KV 캐싱을 통해 빠른 추론을 지원하지만, 장기 비디오 학습 중 메모리 문제로 인해 품질이 저하되는 경우가 많다. 또한, 정적인 프롬프트 기반 생성 이상으로, 스트리밍 프롬프트 입력과 같은 상호작용 기능은 사용자가 실시간으로 내러티브를 안내할 수 있도록 하여 동적 콘텐츠 생성에 필수적이다. 이러한 상호작용 요구사항은 특히 프롬프트 전환 시 시각적 일관성과 의미적 일관성을 보장하는 데 있어 복잡성을 크게 증가시킨다. 이러한 문제를 해결하기 위해 LongLive는 새로운 프롬프트로 캐시 상태를 갱신하여 원활한 전환을 가능하게 하는 KV 재캐시 메커니즘, 장기 비디오 학습 및 학습-추론 정렬을 가능하게 하는 스트리밍 장기 튜닝, 그리고 프레임 단위 주의 싱크(frame sink)와 짝을 이루는 짧은 윈도우 주의 메커니즘을 통합한 인과적 프레임 단위 AR 설계를 채택한다. 이러한 핵심 설계를 통해 LongLive는 1.3B 파라미터의 짧은 클립 모델을 단 32 GPU-일 만에 분 단위 생성으로 미세 조정한다. 추론 시 LongLive는 단일 NVIDIA H100에서 20.7 FPS를 유지하며, 짧은 비디오와 긴 비디오 모두에서 VBench에서 강력한 성능을 달성한다. LongLive는 단일 H100 GPU에서 최대 240초 길이의 비디오를 지원하며, INT8 양자화 추론도 지원하여 품질 손실을 최소화한다.

English

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

LongLive: 실시간 인터랙티브 장편 비디오 생성

LongLive: Real-time Interactive Long Video Generation

초록

Support