拡散トランスフォーマーを用いた高速ビデオ生成のための適応キャッシング

要旨

高品質な動画を時間的に一貫性のあるものに生成することは、特に長い時間スパンにわたっては計算コストが高くなる可能性があります。より最近の拡散トランスフォーマー（DiTs）は、この文脈において重要な進展を遂げてきましたが、より大きなモデルや重い注意機構に依存するため、推論速度が遅くなるという課題をさらに増幅させています。本論文では、ビデオDiTsを加速するためのトレーニングフリーな手法であるAdaptive Caching（AdaCache）を紹介します。この手法は、「すべての動画が同じように生成されるわけではない」という事実に基づいており、つまり、一部の動画は他の動画よりも適切な品質を達成するためにノイズ除去ステップが少なくて済むということを意味しています。これに基づいて、拡散プロセスを介して計算をキャッシュするだけでなく、各ビデオ生成に合わせたキャッシュスケジュールを考案し、品質とレイテンシのトレードオフを最大化します。さらに、Motion Regularization（MoReg）スキームを導入して、AdaCache内でビデオ情報を活用し、基本的に動きの内容に基づいて計算割り当てを制御します。これらのプラグアンドプレイの貢献により、複数のビデオDiTベースラインにわたって、推論速度を著しく向上させることが可能となります（例：Open-Sora 720p - 2sビデオ生成において最大4.7倍）。

English

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.

拡散トランスフォーマーを用いた高速ビデオ生成のための適応キャッシング

Adaptive Caching for Faster Video Generation with Diffusion Transformers

要旨

Support