DDiT：効率的な拡散Transformerのための動的パッチスケジューリング

要旨

拡散トランスフォーマー（DiT）は画像・動画生成において最先端の性能を達成しているが、その成功には多大な計算コストが伴う。この非効率性は主に、コンテンツの複雑度やノイズ除去段階に関わらず一定サイズのパッチを使用する固定トークン化プロセスに起因している。本論文では、コンテンツの複雑度とノイズ除去のタイムステップに基づいてパッチサイズを動的に変化させる効率的な推論時戦略「動的トークン化」を提案する。重要な洞察は、初期タイムステップでは大域的な構造をモデル化するために粗いパッチのみが必要であるのに対し、後期の反復では局所的な詳細を精緻化するために細かい（小サイズの）パッチが要求されるという点である。推論時、本手法は画像・動画生成においてノイズ除去ステップ間でパッチサイズを動的に再配分し、知覚的生成品質を維持しながら計算コストを大幅に削減する。大規模な実験により本手法の有効性を実証し、FLUX-1.Devでは3.52倍、Wan 2.1では3.2倍の高速化を生成品質やプロンプト遵守性を損なうことなく達成した。

English

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52times and 3.2times speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.

DDiT：効率的な拡散Transformerのための動的パッチスケジューリング

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

要旨

Support