LongLive: リアルタイムインタラクティブ長尺動画生成

要旨

本論文では、リアルタイムかつインタラクティブな長尺動画生成のためのフレームレベル自己回帰（AR）フレームワーク「LongLive」を提案する。長尺動画生成は、効率性と品質の両面で課題を抱えている。拡散モデルや拡散強制モデルは高品質な動画を生成できるが、双方向注意機構のため効率性が低い。因果的注意機構を採用したARモデルはKVキャッシュを利用して推論を高速化できるが、長尺動画の学習におけるメモリ課題により品質が低下しがちである。さらに、静的なプロンプトベースの生成を超えて、ストリーミングプロンプト入力などのインタラクティブ機能は、ユーザーがリアルタイムでナラティブを誘導できる動的コンテンツ作成において重要である。このインタラクティブ要件は、特にプロンプト遷移時の視覚的一貫性と意味的整合性を確保する上で、複雑さを大幅に増大させる。これらの課題に対処するため、LongLiveは因果的フレームレベルAR設計を採用し、新たなプロンプトでキャッシュ状態を更新するKV再キャッシュメカニズムを統合して滑らかで密着した切り替えを実現する。また、長尺動画学習を可能にし、学習と推論を整合させるためのストリーミング長尺チューニング（train-long-test-long）を導入する。さらに、フレームレベル注意シンク（frame sink）と組み合わせた短いウィンドウ注意機構により、長距離の一貫性を維持しつつ高速な生成を実現する。これらの主要な設計により、LongLiveは1.3Bパラメータの短尺クリップモデルをわずか32 GPU日で分単位の生成にファインチューニングする。推論時には、単一のNVIDIA H100上で20.7 FPSを維持し、短尺および長尺動画の両方でVBenchにおいて高い性能を達成する。LongLiveは単一のH100 GPU上で最大240秒の動画をサポートする。さらに、LongLiveはINT8量子化推論をサポートし、品質の低下を最小限に抑える。

English

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

LongLive: リアルタイムインタラクティブ長尺動画生成

LongLive: Real-time Interactive Long Video Generation

要旨

Support