无多模态注意力的门控条件注入：实现可控线性注意力Transformer

摘要

基于扩散模型的可控视觉生成技术近期取得显著进展，图像质量实现重大突破。然而，由于这类强大模型对计算资源的高要求，通常需部署在云端服务器，这引发了用户数据隐私的严重关切。为实现安全高效的端侧生成，本文探索基于线性注意力架构的可控扩散模型，该架构即便在边缘设备上也具备卓越的可扩展性和效率。但实验表明，现有可控生成框架（如ControlNet与OminiControl）在线性注意力模型上存在局限：要么缺乏对多类型异质条件的灵活支持，要么面临收敛速度缓慢的问题。为此，我们提出一种专为SANA等线性注意力骨干网络设计的新型可控扩散框架。该方法的核心在于双路径流水线中的统一门控条件模块，可有效整合空间对齐与非对齐提示等多类型条件输入。在多任务与多基准测试上的广泛实验表明，我们的方法基于线性注意力模型实现了最先进的可控生成性能，在保真度与可控性方面均超越现有方法。

English

Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.

无多模态注意力的门控条件注入：实现可控线性注意力Transformer

Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

摘要

Support