Qihoo-T2X: Een efficiëntiegericht diffusietransformator via proxytokens voor tekst-naar-alle-taken

Samenvatting

Het globale self-attention-mechanisme in diffusie-transformers omvat overbodige berekeningen vanwege de spaarzame en redundante aard van visuele informatie, en de aandachtkaart van tokens binnen een ruimtelijk venster vertoont aanzienlijke gelijkenis. Om deze redundantie aan te pakken, stellen we de Proxy Token Diffusion Transformer (PT-DiT) voor, die spaarzame representatieve token-attentie gebruikt (waarbij het aantal representatieve tokens veel kleiner is dan het totale aantal tokens) om globale visuele informatie efficiënt te modelleren. Specifiek wordt in elk transformer-blok willekeurig één token uit elk ruimtelijk-tijdelijk venster geselecteerd om als proxy-token voor die regio te dienen. De globale semantiek wordt vastgelegd via de self-attention van deze proxy-tokens en vervolgens geïnjecteerd in alle latente tokens via cross-attention. Tegelijkertijd introduceren we venster- en verschoven venster-attentie om de beperkingen in detailmodellering veroorzaakt door het spaarzame aandachtmechanisme aan te pakken. Gebaseerd op de goed ontworpen PT-DiT, ontwikkelen we verder de Qihoo-T2X-familie, die een verscheidenheid aan modellen omvat voor T2I-, T2V- en T2MV-taken. Experimentele resultaten tonen aan dat PT-DiT competitieve prestaties bereikt terwijl de rekencomplexiteit wordt verminderd in zowel beeld- als videogeneratietaken (bijvoorbeeld een reductie van 48% vergeleken met DiT en een reductie van 35% vergeleken met Pixart-alpha). Onze broncode is beschikbaar op https://github.com/360CVGroup/Qihoo-T2X.

English

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy Token Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, in each transformer block, we randomly sample one token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to Pixart-alpha). Our source code is available at https://github.com/360CVGroup/Qihoo-T2X.

Qihoo-T2X: Een efficiëntiegericht diffusietransformator via proxytokens voor tekst-naar-alle-taken

Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

Samenvatting

Summary

Support

Support