Cross-Attention Maakt Inferentie Omslachtig in Text-naar-Beeld Diffusiemodellen

Samenvatting

Dit onderzoek verkent de rol van cross-attentie tijdens inferentie in tekst-conditionele diffusiemodellen. We ontdekken dat de uitvoer van cross-attentie convergeert naar een vast punt na enkele inferentiestappen. Dienovereenkomstig verdeelt het tijdstip van convergentie het gehele inferentieproces natuurlijk in twee fasen: een initiële semantiekplanningsfase, waarin het model vertrouwt op cross-attentie om tekstgerichte visuele semantiek te plannen, en een daaropvolgende kwaliteitsverbeteringsfase, waarin het model probeert afbeeldingen te genereren uit eerder geplande semantiek. Verrassend genoeg vermindert het negeren van tekstcondities in de kwaliteitsverbeteringsfase niet alleen de rekencomplexiteit, maar behoudt het ook de modelprestaties. Dit resulteert in een eenvoudige en trainingsvrije methode genaamd TGATE voor efficiënte generatie, die de cross-attentie-uitvoer in de cache opslaat zodra deze convergeert en deze vastzet tijdens de resterende inferentiestappen. Onze empirische studie op de MS-COCO validatieset bevestigt de effectiviteit ervan. De broncode van TGATE is beschikbaar op https://github.com/HaozheLiu-ST/T-GATE.

English

This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

Cross-Attention Maakt Inferentie Omslachtig in Text-naar-Beeld Diffusiemodellen

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

Samenvatting

Support