動的トークンカービングによるトレーニング不要な効率的な動画生成

要旨

ビデオDiffusion Transformer（DiT）モデルは驚くべき生成品質を実現しているものの、その実用的な展開は膨大な計算要件によって深刻に妨げられています。この非効率性は、主に2つの課題に起因しています：トークン長に対するセルフアテンションの二次複雑性と、拡散モデルの多段階的な性質です。これらの制限に対処するため、我々は動的アテンションカービングと段階的解像度生成を組み合わせた新しい推論パイプラインであるJengaを提案します。我々のアプローチは、以下の2つの重要な洞察を活用しています：(1)初期のノイズ除去ステップでは高解像度の潜在変数は必要なく、(2)後期のステップでは密なアテンションは必要ありません。Jengaは、3D空間充填曲線を用いて関連するトークン相互作用を動的に選択するブロック単位のアテンションメカニズムと、生成中に潜在解像度を段階的に増加させるプログレッシブ解像度戦略を導入します。実験結果は、Jengaが複数の最先端ビデオ拡散モデルにおいて大幅な高速化を実現しつつ、同等の生成品質を維持することを示しています（VBenchにおいて8.83倍の高速化と0.01%の性能低下）。プラグアンドプレイソリューションとして、Jengaはモデルの再学習を必要とせずに、推論時間を数分から数秒に短縮することで、現代のハードウェア上での実用的で高品質なビデオ生成を可能にします。コード：https://github.com/dvlab-research/Jenga

English

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83times speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

動的トークンカービングによるトレーニング不要な効率的な動画生成

Training-Free Efficient Video Generation via Dynamic Token Carving

要旨

Support