Heroverdenken van Cross-Modale Interactie in Multimodale Diffusie Transformers

Samenvatting

Multimodal Diffusion Transformers (MM-DiTs) hebben opmerkelijke vooruitgang geboekt in tekstgestuurde visuele generatie. Toch hebben zelfs state-of-the-art MM-DiT-modellen zoals FLUX moeite met het bereiken van precieze afstemming tussen tekstprompts en gegenereerde inhoud. We identificeren twee belangrijke problemen in het aandachtmechanisme van MM-DiT, namelijk 1) de onderdrukking van cross-modale aandacht door tokenonbalans tussen visuele en tekstuele modaliteiten en 2) het ontbreken van tijdstapbewuste aandachtweging, wat de afstemming belemmert. Om deze problemen aan te pakken, stellen we Temperature-Adjusted Cross-modal Attention (TACA) voor, een parameter-efficiënte methode die multimodale interacties dynamisch herbalanceert door temperatuurschaling en tijdstapafhankelijke aanpassing. In combinatie met LoRA-finetuning verbetert TACA de tekst-beeldafstemming aanzienlijk op de T2I-CompBench-benchmark met minimale rekenkosten. We hebben TACA getest op state-of-the-art modellen zoals FLUX en SD3.5, waarbij we aantoonden dat het de beeld-tekstafstemming kan verbeteren op het gebied van objectverschijning, attribuutbinding en ruimtelijke relaties. Onze bevindingen benadrukken het belang van het balanceren van cross-modale aandacht voor het verbeteren van semantische trouw in tekst-naar-beeld diffusiemodellen. Onze code is publiekelijk beschikbaar op https://github.com/Vchitect/TACA.

English

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at https://github.com/Vchitect/TACA

Heroverdenken van Cross-Modale Interactie in Multimodale Diffusie Transformers

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Samenvatting

Support