PAROAttention:面向视觉生成模型的高效稀疏与量化注意力机制中的模式感知重排序
PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
June 19, 2025
作者: Tianchen Zhao, Ke Hong, Xinhao Yang, Xuefeng Xiao, Huixia Li, Feng Ling, Ruiqi Xie, Siqi Chen, Hongyu Zhu, Yichong Zhang, Yu Wang
cs.AI
摘要
在视觉生成领域,注意力机制的二次方复杂度导致了高昂的内存和计算成本,尤其是在高分辨率图像或多帧视频生成所需的较长令牌序列场景下。为应对这一挑战,先前研究已探索了诸如稀疏化和量化等技术。然而,这些技术在低密度和减少位宽条件下面临显著困难。通过系统分析,我们发现核心难题源于视觉注意力模式的分散性和不规则性。因此,我们提出了一种替代策略:*重组*注意力模式以缓解这些挑战。受视觉特征提取局部聚合特性的启发,我们设计了一种新颖的**模式感知令牌重排序(PARO)**技术,该技术将多样化的注意力模式统一为硬件友好的块状模式。这种统一显著简化和提升了稀疏化与量化的效果。我们评估了各种设计选择在性能与效率之间的权衡,并最终确定了一种针对统一模式量身定制的方法论。我们的方法,**PAROAttention**,在显著降低密度(约20%-30%)和位宽(**INT8/INT4**)的同时,实现了视频和图像生成的无损指标,与全精度(FP)基线几乎一致的结果,并取得了**1.9倍**至**2.7倍**的端到端延迟加速。
English
In visual generation, the quadratic complexity of attention mechanisms
results in high memory and computational costs, especially for longer token
sequences required in high-resolution image or multi-frame video generation. To
address this, prior research has explored techniques such as sparsification and
quantization. However, these techniques face significant challenges under low
density and reduced bitwidths. Through systematic analysis, we identify that
the core difficulty stems from the dispersed and irregular characteristics of
visual attention patterns. Therefore, instead of introducing specialized
sparsification and quantization design to accommodate such patterns, we propose
an alternative strategy: *reorganizing* the attention pattern to alleviate the
challenges. Inspired by the local aggregation nature of visual feature
extraction, we design a novel **Pattern-Aware token ReOrdering (PARO)**
technique, which unifies the diverse attention patterns into a
hardware-friendly block-wise pattern. This unification substantially simplifies
and enhances both sparsification and quantization. We evaluate the
performance-efficiency trade-offs of various design choices and finalize a
methodology tailored for the unified pattern. Our approach, **PAROAttention**,
achieves video and image generation with lossless metrics, and nearly identical
results from full-precision (FP) baselines, while operating at notably lower
density (~20%-30%) and bitwidth (**INT8/INT4**), achieving a **1.9x** to
**2.7x** end-to-end latency speedup.