PSA:面向高效视频理解与生成的金字塔稀疏注意力机制
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
December 3, 2025
作者: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
cs.AI
摘要
注意力机制是基础模型的核心,但其二次复杂度仍是制约模型扩展的关键瓶颈。这一挑战推动了高效注意力机制的发展,其中稀疏化已成为主流范式。现有方法通常通过二值掩码保留或丢弃完整的键值块,在高稀疏度下会导致显著信息损失。为缓解这一问题,我们提出金字塔稀疏注意力(PSA)——一种可同时适用于视频理解与生成任务的通用模块。PSA摒弃二值掩码,引入多级池化键值表征,实现更精细的掩码粒度。具体而言,每个查询块动态分配较低池化层级给关键键值块,较高层级分配给次要块,在完整保留与彻底剪枝之间构建信息化的插值方案。该设计借鉴了计算机视觉中的定点量化思想和经典特征金字塔网络,在低计算预算下既能保持计算效率,又可有效缓解信息损失。PSA采用原生硬件友好内核,通过解耦的块-瓦片设计确保高效执行。在视频理解与生成基准测试中,PSA在保持上下文信息和视觉保真度的同时,始终优于或达到现有稀疏注意力基线性能,并展现出更优的效率-质量平衡。代码与模型权重已开源:http://ziplab.co/PSA
English
Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA