SALAD:基於高效線性注意力調適實現影片擴散轉換器的高稀疏度注意力機制
SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
January 23, 2026
作者: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang
cs.AI
摘要
擴散轉換器近期在影片生成領域展現出卓越性能。然而,由於全注意力機制的二次方計算複雜度,長輸入序列會導致高昂的計算延遲。現有多種稀疏注意力機制被提出:無需訓練的稀疏注意力受限於稀疏度不足,僅能實現有限加速;而基於訓練的方法雖可達到更高稀疏度,卻需要大量數據和計算資源進行訓練。本研究提出SALAD方法,在稀疏注意力旁並行引入輕量級線性注意力分支。透過輸入依賴的門控機制精細調控雙分支權衡,我們的方法在維持與全注意力基準相當的生成質量同時,實現了90%的稀疏度與1.72倍推理加速。此外,我們的微調流程極具效率,僅需2,000個影片樣本、以批次大小8進行1,600步訓練即可完成。
English
Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.