LiteAttention：擴散轉換器的時態稀疏注意力機制（注：此處採用「擴散轉換器」而非直譯「擴散變壓器」，因在AI領域中Transformer用於圖像生成時常譯為「轉換器」；「Temporal Sparse Attention」結合上下文譯為「時態稀疏注意力機制」以體現時間維度的稀疏特性）

摘要

擴散變換器在影片生成領域已實現卓越的生成品質，但其二次方注意力複雜度會導致難以承受的計算延遲。現有加速方法面臨根本性難題：在每個去噪步驟動態估算稀疏注意力模式會產生高計算開銷與估算誤差，而靜態稀疏模式雖固定不變，卻往往在整個去噪過程中處於次優狀態。我們發現擴散注意力具有關鍵的結構特性——其稀疏模式在跨去噪步驟間展現出強烈的時間連貫性。在步驟t中被判定為非關鍵的計算區塊，通常到步驟t+δ時仍保持非關鍵狀態。基於此發現，我們提出LiteAttention方法，利用時間連貫性實現跨去噪序列的演化式計算跳躍。通過早期標記非關鍵區塊並向前傳播跳躍決策，該方法無需重複分析開銷即可消除冗餘注意力計算，兼具動態方法的自適應性與靜態方法的高效性。我們在FlashAttention基礎上實現了高度優化的LiteAttention核心，並在商用影片擴散模型中驗證了顯著加速效果，且無任何品質損耗。程式碼與實作細節將公開釋出。

English

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step t typically remain so at step t+δ. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

摘要

Support