扩散变换器中基于上下文空间的动态排斥实现丰富多样性

摘要

现代文本到图像（T2I）扩散模型虽已实现显著的语义对齐能力，却常因生成多样性不足而局限于狭窄的视觉解决方案集。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们发现当前多样性方法存在根本性权衡：修改模型输入需通过昂贵优化来融合生成路径的反馈，而对空间已定型的中间隐变量施加干预则易破坏正在形成的视觉结构，导致伪影产生。本研究提出在上下文空间中施加排斥力作为扩散Transformer实现丰富多样性的新框架。通过介入多模态注意力通道，我们在Transformer前向传播过程中实施实时排斥干预，将文本条件与涌现的图像结构共同注入模块间的交互层。这使得系统能在视觉结构形成后、构图固化前重导航向轨迹。实验结果表明，上下文空间排斥法在保持视觉保真度与语义一致性的同时，能产生显著更丰富的多样性。此外，本方法具有独特的高效性，仅需极小计算开销即可生效，即便在传统基于轨迹的干预通常失效的现代"Turbo"及蒸馏模型中仍保持卓越性能。

English

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.