ChatPaper.aiChatPaper

SALAD:通过高效线性注意力调优实现视频扩散变压器的高稀疏性注意力机制

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

January 23, 2026
作者: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang
cs.AI

摘要

扩散变换器在视频生成领域近期展现出卓越性能。然而,由于全注意力机制的二次复杂度,长输入序列会导致高昂的计算延迟。现有研究提出了多种稀疏注意力机制:免训练的稀疏注意力受限于稀疏度上限,加速效果有限;而基于训练的方法虽能达到更高稀疏度,但需要大量数据和计算资源进行训练。本研究提出SALAD方法,通过在稀疏注意力旁并行引入轻量级线性注意力分支,并采用输入依赖的门控机制精细平衡双支路输出,实现了90%的稀疏度和1.72倍推理加速,同时保持与全注意力基线相当的生成质量。此外,我们的微调过程极具效率,仅需2,000个视频样本和1,600个训练步数(批大小为8)。
English
Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
PDF112January 27, 2026