重新思考多模态扩散变换器中的跨模态交互
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
June 9, 2025
作者: Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong
cs.AI
摘要
多模态扩散变换器(MM-DiTs)在文本驱动的视觉生成领域取得了显著进展。然而,即便是如FLUX这样的顶尖MM-DiT模型,在实现文本提示与生成内容之间的精确对齐方面仍面临挑战。我们识别出MM-DiT注意力机制中的两个关键问题:一是由于视觉与文本模态间令牌不平衡导致的跨模态注意力抑制;二是缺乏时间步感知的注意力权重分配,这两者均阻碍了对齐效果。为解决这些问题,我们提出了温度调节跨模态注意力(TACA),这是一种参数高效的方法,通过温度缩放和时间步依赖的调整,动态地重新平衡多模态交互。结合LoRA微调,TACA在T2I-CompBench基准测试中显著提升了文本-图像对齐效果,且计算开销极小。我们在FLUX和SD3.5等先进模型上测试了TACA,证明了其在改善图像-文本对齐方面,特别是在物体外观、属性绑定及空间关系上的能力。我们的研究结果强调了平衡跨模态注意力在提升文本到图像扩散模型语义保真度中的重要性。相关代码已公开于https://github.com/Vchitect/TACA。
English
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress
in text-driven visual generation. However, even state-of-the-art MM-DiT models
like FLUX struggle with achieving precise alignment between text prompts and
generated content. We identify two key issues in the attention mechanism of
MM-DiT, namely 1) the suppression of cross-modal attention due to token
imbalance between visual and textual modalities and 2) the lack of
timestep-aware attention weighting, which hinder the alignment. To address
these issues, we propose Temperature-Adjusted Cross-modal Attention
(TACA), a parameter-efficient method that dynamically rebalances multimodal
interactions through temperature scaling and timestep-dependent adjustment.
When combined with LoRA fine-tuning, TACA significantly enhances text-image
alignment on the T2I-CompBench benchmark with minimal computational overhead.
We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating
its ability to improve image-text alignment in terms of object appearance,
attribute binding, and spatial relationships. Our findings highlight the
importance of balancing cross-modal attention in improving semantic fidelity in
text-to-image diffusion models. Our codes are publicly available at
https://github.com/Vchitect/TACA