重新思考多模態擴散變壓器中的跨模態交互
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
June 9, 2025
作者: Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong
cs.AI
摘要
多模態擴散變換器(MM-DiTs)在文本驅動的視覺生成領域取得了顯著進展。然而,即便是如FLUX這樣的頂尖MM-DiT模型,在實現文本提示與生成內容的精確對齊方面仍面臨挑戰。我們發現MM-DiT的注意力機制存在兩個關鍵問題:一是視覺與文本模態間令牌不平衡導致的跨模態注意力抑制,二是缺乏時間步感知的注意力權重分配,這些都阻礙了對齊效果。為解決這些問題,我們提出了溫度調節跨模態注意力(TACA),這是一種參數高效的方法,通過溫度縮放和時間步依賴性調整來動態重新平衡多模態交互。結合LoRA微調,TACA在T2I-CompBench基準測試中顯著提升了文本-圖像對齊效果,且計算開銷極小。我們在FLUX和SD3.5等頂尖模型上測試了TACA,證明了其在改善物體外觀、屬性綁定及空間關係方面的圖像-文本對齊能力。我們的研究結果強調了平衡跨模態注意力在提升文本到圖像擴散模型語義保真度中的重要性。我們的代碼已公開於https://github.com/Vchitect/TACA。
English
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress
in text-driven visual generation. However, even state-of-the-art MM-DiT models
like FLUX struggle with achieving precise alignment between text prompts and
generated content. We identify two key issues in the attention mechanism of
MM-DiT, namely 1) the suppression of cross-modal attention due to token
imbalance between visual and textual modalities and 2) the lack of
timestep-aware attention weighting, which hinder the alignment. To address
these issues, we propose Temperature-Adjusted Cross-modal Attention
(TACA), a parameter-efficient method that dynamically rebalances multimodal
interactions through temperature scaling and timestep-dependent adjustment.
When combined with LoRA fine-tuning, TACA significantly enhances text-image
alignment on the T2I-CompBench benchmark with minimal computational overhead.
We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating
its ability to improve image-text alignment in terms of object appearance,
attribute binding, and spatial relationships. Our findings highlight the
importance of balancing cross-modal attention in improving semantic fidelity in
text-to-image diffusion models. Our codes are publicly available at
https://github.com/Vchitect/TACA