扩散变换器中基于上下文空间的动态排斥实现丰富多样性

摘要

现代文本到图像（T2I）扩散模型已实现显著的语义对齐能力，但其生成结果往往缺乏多样性，针对同一提示词常收敛于狭窄的视觉解决方案集合。这种典型性偏差对需要多样化生成结果的创意应用构成挑战。我们发现当前多样性方法存在根本性权衡：修改模型输入需通过昂贵优化来融合生成路径的反馈，而对空间已定型的中间隐变量施加干预则易破坏正在形成的视觉结构，导致伪影产生。本研究提出在上下文空间中施加排斥力作为扩散 Transformer 实现丰富多样性的新框架。通过在多模态注意力通道进行干预，我们在 Transformer 前向传播过程中实施实时排斥操作，将干预注入到文本条件与涌现图像结构相融合的模块间隙。这使得引导轨迹能在结构信息形成后、构图固化前被重定向。实验结果表明，上下文空间排斥法在保持视觉保真度与语义一致性的同时，能产生显著更丰富的多样性。此外，本方法具有独特的高效性，仅增加微小计算开销，且在现代"Turbo"模型和蒸馏模型中仍保持有效性——而传统基于轨迹的干预方法在这些场景下通常失效。

English

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

扩散变换器中基于上下文空间的动态排斥实现丰富多样性

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

摘要

Support