SmoothCache: 拡散トランスフォーマーのための汎用推論高速化テクニック

要旨

拡散トランスフォーマー（DiT）は、画像、動画、音声合成を含むさまざまなタスクにおいて強力な生成モデルとして台頭しています。ただし、繰り返し評価されるリソース集約型の注意とフィードフォワードモジュールにより、推論プロセスは計算コストが高いままです。この課題に対処するために、DiTアーキテクチャ向けのモデルに依存しない推論加速技術であるSmoothCacheを紹介します。SmoothCacheは、隣接する拡散タイムステップ間での層の出力の高い類似性を活用します。小さなキャリブレーションセットからの層ごとの表現エラーを分析することで、SmoothCacheは推論中にキーとなる特徴を適応的にキャッシュし再利用します。実験では、SmoothCacheがさまざまなモダリティにわたり、生成品質を維持または向上させつつ、8%から71%の高速化を達成することを示しました。画像生成のDiT-XL、テキストから動画へのOpen-Sora、テキストからオーディオへのStable Audio OpenでSmoothCacheの効果を紹介し、強力なDiTモデルのリアルタイムアプリケーションを可能にし、アクセス性を広げる潜在能力を示しました。

English

Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

SmoothCache: 拡散トランスフォーマーのための汎用推論高速化テクニック

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

要旨

Support