重新思考多模态扩散变换器中的跨模态交互

摘要

多模态扩散变换器（MM-DiTs）在文本驱动的视觉生成领域取得了显著进展。然而，即便是如FLUX这样的顶尖MM-DiT模型，在实现文本提示与生成内容之间的精确对齐方面仍面临挑战。我们识别出MM-DiT注意力机制中的两个关键问题：一是由于视觉与文本模态间令牌不平衡导致的跨模态注意力抑制；二是缺乏时间步感知的注意力权重分配，这两者均阻碍了对齐效果。为解决这些问题，我们提出了温度调节跨模态注意力（TACA），这是一种参数高效的方法，通过温度缩放和时间步依赖的调整，动态地重新平衡多模态交互。结合LoRA微调，TACA在T2I-CompBench基准测试中显著提升了文本-图像对齐效果，且计算开销极小。我们在FLUX和SD3.5等先进模型上测试了TACA，证明了其在改善图像-文本对齐方面，特别是在物体外观、属性绑定及空间关系上的能力。我们的研究结果强调了平衡跨模态注意力在提升文本到图像扩散模型语义保真度中的重要性。相关代码已公开于https://github.com/Vchitect/TACA。

English

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at https://github.com/Vchitect/TACA

重新思考多模态扩散变换器中的跨模态交互

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

摘要

Support