Qihoo-T2X：一种通过代理标记专注于效率的扩散变压器，用于文本到任何任务

摘要

扩散Transformer中的全局自注意机制涉及冗余计算，这是由于视觉信息的稀疏和冗余性，以及空间窗口内令牌的注意图显示出显著的相似性。为了解决这种冗余性，我们提出了代理令牌扩散Transformer（PT-DiT），它采用稀疏的代表性令牌注意力（代表性令牌数量远小于总令牌数量）来高效建模全局视觉信息。具体而言，在每个Transformer块中，我们从每个时空窗口中随机抽样一个令牌，作为该区域的代理令牌。通过这些代理令牌的自注意力捕获全局语义，然后通过交叉注意力注入到所有潜在令牌中。同时，我们引入窗口和移动窗口注意力，以解决稀疏注意机制导致的详细建模限制。基于精心设计的PT-DiT，我们进一步开发了奇虎-T2X系列，包括各种用于T2I、T2V和T2MV任务的模型。实验结果显示，PT-DiT在减少图像和视频生成任务中的计算复杂性方面取得了竞争性表现（例如，与DiT相比减少了48%，与Pixart-alpha相比减少了35%）。我们的源代码可在https://github.com/360CVGroup/Qihoo-T2X获取。

English

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy Token Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, in each transformer block, we randomly sample one token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to Pixart-alpha). Our source code is available at https://github.com/360CVGroup/Qihoo-T2X.

Qihoo-T2X：一种通过代理标记专注于效率的扩散变压器，用于文本到任何任务

Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

摘要

Support