ChatPaper.aiChatPaper

扩散Transformer中的全局文本条件机制再思考

Rethinking Global Text Conditioning in Diffusion Transformers

February 9, 2026
作者: Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk
cs.AI

摘要

扩散变换器通常通过注意力层和基于池化文本嵌入的调制机制来整合文本信息。然而,近期方法摒弃了基于调制的文本条件处理,完全依赖注意力机制。本文旨在探究基于调制的文本条件处理是否必要,以及其是否能带来性能优势。我们的分析表明,在传统应用场景中,池化嵌入对整体性能贡献甚微,这说明仅靠注意力机制通常足以准确传递提示信息。但我们发现,当从不同视角利用池化嵌入时——将其作为引导信号以实现向更理想属性的可控偏移——它能带来显著性能提升。这种方法无需重新训练、实现简单、运行时开销可忽略不计,可应用于各类扩散模型,在文本到图像/视频生成及图像编辑等多种任务中均能带来改进。
English
Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
PDF81February 12, 2026