LiteAttention：一种面向扩散变换器的时间稀疏注意力机制（注：该翻译保持了专业术语的准确性，同时符合中文表达习惯。"Temporal Sparse Attention"译为"时间稀疏注意力机制"，既体现了时间维度特性，又明确了稀疏注意力机制的技术内涵。"for Diffusion Transformers"采用"面向扩散变换器"的译法，突出该方法的针对性应用场景。）

摘要

扩散变换器在视频生成领域展现出卓越的质量，但其二次方注意力复杂度导致难以承受的延迟问题。现有加速方法面临根本性权衡：动态估计每个去噪步骤的稀疏注意力模式会产生高计算开销和估计误差，而静态稀疏模式在整个去噪过程中保持固定且往往次优。我们发现了扩散注意力的关键结构特性——其稀疏模式在跨去噪步骤间具有强时序连贯性：在步骤t被判定为非关键的区块，通常在步骤t+δ仍保持非关键状态。基于这一发现，我们提出LiteAttention方法，利用时序连贯性实现跨去噪序列的演化式计算跳跃。通过早期标记非关键区块并向前传播跳跃决策，LiteAttention在无需重复性能分析开销的情况下消除冗余注意力计算，兼具动态方法的自适应性和静态方法的高效性。我们在FlashAttention基础上实现了高度优化的LiteAttention内核，并在生产级视频扩散模型上验证了显著的加速效果，且未造成质量损失。代码与实现细节将公开发布。

English

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step t typically remain so at step t+δ. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

摘要

Support