マルチモーダル注意を用いないゲート条件注入：制御可能な線形注意トランスフォーマーに向けて

要旨

拡散モデルに基づく制御可能なビジュアル生成の近年の進展により、画像品質は著しく向上している。しかし、これらの強力なモデルは計算需要が大きいため、通常クラウドサーバー上にデプロイされ、ユーザーデータのプライバシーに関する重大な懸念を引き起こしている。安全かつ効率的なオンデバイス生成を実現するため、本論文では、エッジデバイス上であっても優れた拡張性と効率性を提供する線形注意機構を基盤とした制御可能な拡散モデルを探求する。しかしながら、我々の実験により、ControlNetやOminiControlなどの既存の制御可能生成フレームワークは、複数の異種条件タイプをサポートする柔軟性に欠けるか、あるいはこのような線形注意モデル上では収束が遅いという課題が明らかになった。これらの限界に対処するため、我々はSANAのような線形注意バックボーンに特化した新規の制御可能拡散フレームワークを提案する。本手法の核心は、デュアルパイプラインで動作する統一されたゲート付き条件付けモジュールにあり、空間的に整合性のある手がかりと非整合性のある手がかりなど、複数タイプの条件入力を効果的に統合する。複数のタスクとベンチマークにおける広範な実験により、本アプローチが線形注意モデルに基づく制御可能生成性能において既存手法を忠実度と制御性の点で凌駕し、最先端の性能を達成することを実証する。

English

Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.

マルチモーダル注意を用いないゲート条件注入：制御可能な線形注意トランスフォーマーに向けて

Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

要旨

Support