RelaCtrl: 拡散トランスフォーマーのための関連性誘導型効率的制御

要旨

拡散トランスフォーマー（Diffusion Transformer）は、主にその本質的なスケーラビリティにより、テキストから画像およびテキストから動画生成の進展において重要な役割を果たしています。しかし、既存の制御付き拡散トランスフォーマー手法は、パラメータと計算コストが大きく、異なるトランスフォーマーレイヤー間での制御情報の関連性を考慮しないため、リソース割り当てが非効率的です。この問題を解決するため、我々は「関連性ガイド型効率的制御生成フレームワーク（Relevance-Guided Efficient Controllable Generation framework, RelaCtrl）」を提案し、制御信号を拡散トランスフォーマーに効率的かつリソース最適化された形で統合します。まず、拡散トランスフォーマーの各レイヤーが制御情報に対して持つ関連性を、「ControlNet関連性スコア」—すなわち、各制御レイヤーをスキップした場合の生成品質と制御効果への影響—を評価することで測定します。関連性の強度に基づいて、制御レイヤーの配置、パラメータ規模、およびモデリング能力を調整し、不要なパラメータと冗長な計算を削減します。さらに、効率を向上させるため、一般的に使用されるコピーブロック内の自己注意機構（self-attention）とフィードフォワードネットワーク（FFN）を、慎重に設計された二次元シャッフルミキサー（Two-Dimensional Shuffle Mixer, TDSM）に置き換え、トークンミキサーとチャネルミキサーの両方を効率的に実装します。定性的および定量的な実験結果は、我々のアプローチがPixArt-deltaと比較してわずか15%のパラメータと計算複雑性で優れた性能を達成することを示しています。詳細な例はhttps://relactrl.github.io/RelaCtrl/でご覧いただけます。

English

The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta. More examples are available at https://relactrl.github.io/RelaCtrl/.

RelaCtrl: 拡散トランスフォーマーのための関連性誘導型効率的制御

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

要旨

Support