デュアル・パラレリズムを用いた1分間動画

要旨

Diffusion Transformer (DiT)ベースのビデオ拡散モデルは、大規模な高品質ビデオを生成しますが、長時間のビデオに対しては処理遅延とメモリコストが過大になります。これを解決するため、我々は新しい分散推論戦略であるDualParalを提案します。核心となるアイデアは、単一のGPUでビデオ全体を生成する代わりに、時間的なフレームとモデル層をGPU間で並列化することです。しかし、この分割を単純に実装すると、拡散モデルがフレーム間で同期されたノイズレベルを必要とするため、元々の並列性が直列化されてしまうという重要な制限が生じます。我々はこれを解決するために、ブロック単位のノイズ除去スキームを活用します。具体的には、ノイズレベルが徐々に減少するフレームブロックのシーケンスをパイプラインで処理します。各GPUは特定のブロックと層のサブセットを処理し、前の結果を次のGPUに渡すことで、非同期の計算と通信を可能にします。さらに性能を最適化するため、2つの重要な改良を組み込みます。まず、各GPUに特徴キャッシュを実装し、前のブロックからの特徴をコンテキストとして保存・再利用することで、GPU間の通信と冗長な計算を最小限に抑えます。次に、調整されたノイズ初期化戦略を採用し、初期ノイズパターンをGPU間で共有することで、グローバルに一貫した時間的ダイナミクスを確保し、追加のリソースコストをかけずに実現します。これらを組み合わせることで、高速でアーティファクトのない、無限に長いビデオ生成が可能になります。最新の拡散Transformerビデオジェネレータに適用した結果、我々の手法は8台のRTX 4090 GPUを使用して1,025フレームのビデオを効率的に生成し、最大6.54倍の低遅延と1.48倍の低メモリコストを実現しました。

English

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54times lower latency and 1.48times lower memory cost on 8timesRTX 4090 GPUs.

デュアル・パラレリズムを用いた1分間動画

Minute-Long Videos with Dual Parallelisms

要旨

Support