Video-Infinity: 分散型長尺動画生成

要旨

拡散モデルは最近、映像生成において顕著な成果を上げています。しかしながら、その生成される映像は通常、わずかなフレーム数に制限されており、数秒程度のクリップに留まっています。より長い映像を生成する上での主な課題は、単一GPUにおける膨大なメモリ要件と長時間の処理時間にあります。単純な解決策として、複数のGPUに作業を分散させることが考えられますが、これには2つの問題が生じます：(1) すべてのGPUがタイミングやコンテキスト情報を効果的に共有するための通信を確保すること、(2) 短いシーケンスで通常訓練されている既存の映像拡散モデルを、追加の訓練なしでより長い映像を生成するように修正すること。これらの課題に対処するため、本論文では、長尺映像生成のための複数GPUにわたる並列処理を可能にする分散推論パイプライン「Video-Infinity」を紹介します。具体的には、Clip parallelismとDual-scope attentionという2つの整合性のあるメカニズムを提案します。Clip parallelismは、GPU間でのコンテキスト情報の収集と共有を最適化し、通信オーバーヘッドを最小化します。一方、Dual-scope attentionは、時間的な自己注意を調整し、デバイス間でローカルとグローバルのコンテキストを効率的にバランスさせます。これら2つのメカニズムが連携して、作業負荷を分散し、長尺映像の高速生成を可能にします。8基のNvidia 6000 Ada GPU（48G）のセットアップ下で、本手法は約5分間で最大2,300フレームの映像を生成し、従来の手法に比べて100倍の速度で長尺映像生成を実現します。

English

Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.

Video-Infinity: 分散型長尺動画生成

Video-Infinity: Distributed Long Video Generation

要旨

Support