奇虎-T2X：一種以代理標記為中心的效率導向擴散轉換器，適用於文本到任務的轉換

摘要

擴散Transformer中的全局自注意機制涉及冗餘計算，這是由於視覺信息的稀疏和冗餘性質，以及空間窗口內標記的注意力映射顯示出顯著的相似性。為了解決這種冗餘性，我們提出了代理標記擴散Transformer（PT-DiT），它採用稀疏代表性標記注意力（代表性標記數量遠小於總標記數量）來高效地建模全局視覺信息。具體而言，在每個Transformer塊中，我們從每個空間-時間窗口中隨機抽樣一個標記，作為該區域的代理標記。通過這些代理標記的自注意力來捕捉全局語義，然後通過交叉注意力注入到所有潛在標記中。同時，我們引入窗口和移位窗口注意力，以解決稀疏注意力機制導致的細節建模限制。在設計良好的PT-DiT基礎上，我們進一步開發了奇虎-T2X系列，其中包括各種T2I、T2V和T2MV任務的模型。實驗結果表明，PT-DiT在圖像和視頻生成任務中實現了競爭性性能，同時減少了計算複雜度（例如，與DiT相比減少了48%，與Pixart-alpha相比減少了35%）。我們的源代碼可在https://github.com/360CVGroup/Qihoo-T2X找到。

English

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy Token Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, in each transformer block, we randomly sample one token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to Pixart-alpha). Our source code is available at https://github.com/360CVGroup/Qihoo-T2X.

奇虎-T2X：一種以代理標記為中心的效率導向擴散轉換器，適用於文本到任務的轉換

Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

摘要

Support