扩散Transformer中全局文本条件机制的再思考

摘要

扩散变换器通常通过注意力层和基于池化文本嵌入的调制机制来融合文本信息。然而，近期研究摒弃了基于调制的文本条件控制，完全依赖注意力机制。本文旨在探究基于调制的文本条件控制是否必要，以及其能否带来性能优势。我们的分析表明，在传统使用方式下，池化嵌入对整体性能贡献甚微，这说明仅凭注意力机制通常足以准确传递提示信息。但研究发现，当从不同视角利用池化嵌入时——将其作为引导信号以实现向更理想属性的可控偏移——它能带来显著性能提升。该方法无需重新训练、实现简单、运行时开销可忽略不计，可应用于各类扩散模型，在文本到图像/视频生成及图像编辑等多样化任务中均能带来提升。

English

Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.

扩散Transformer中全局文本条件机制的再思考

Rethinking Global Text Conditioning in Diffusion Transformers

摘要

Support