PARO注意力機制:面向模式感知的重排序策略,以實現視覺生成模型中稀疏與量化注意力的高效運算
PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
June 19, 2025
作者: Tianchen Zhao, Ke Hong, Xinhao Yang, Xuefeng Xiao, Huixia Li, Feng Ling, Ruiqi Xie, Siqi Chen, Hongyu Zhu, Yichong Zhang, Yu Wang
cs.AI
摘要
在視覺生成領域,注意力機制的二次方複雜性導致了高記憶體與計算成本,尤其是在高解析度圖像或多幀視頻生成所需的較長符號序列中。為解決此問題,先前的研究已探索了諸如稀疏化與量化等技術。然而,這些技術在低密度與減少位元寬度下面臨顯著挑戰。通過系統性分析,我們發現核心難題源自視覺注意力模式的分散與不規則特性。因此,我們提出了一種替代策略:*重組*注意力模式以緩解這些挑戰,而非引入專門的稀疏化與量化設計來適應此類模式。受視覺特徵提取的局部聚合特性啟發,我們設計了一種新穎的**模式感知符號重排序(PARO)**技術,該技術將多樣的注意力模式統一為硬體友好的塊狀模式。此統一顯著簡化並增強了稀疏化與量化。我們評估了各種設計選擇的性能效率權衡,並最終確定了一種針對統一模式量身定制的方法。我們的方法,**PAROAttention**,在顯著降低密度(約20%-30%)與位元寬度(**INT8/INT4**)的情況下,實現了無損指標的視頻與圖像生成,並獲得了與全精度(FP)基線幾乎相同的結果,同時實現了**1.9倍**至**2.7倍**的端到端延遲加速。
English
In visual generation, the quadratic complexity of attention mechanisms
results in high memory and computational costs, especially for longer token
sequences required in high-resolution image or multi-frame video generation. To
address this, prior research has explored techniques such as sparsification and
quantization. However, these techniques face significant challenges under low
density and reduced bitwidths. Through systematic analysis, we identify that
the core difficulty stems from the dispersed and irregular characteristics of
visual attention patterns. Therefore, instead of introducing specialized
sparsification and quantization design to accommodate such patterns, we propose
an alternative strategy: *reorganizing* the attention pattern to alleviate the
challenges. Inspired by the local aggregation nature of visual feature
extraction, we design a novel **Pattern-Aware token ReOrdering (PARO)**
technique, which unifies the diverse attention patterns into a
hardware-friendly block-wise pattern. This unification substantially simplifies
and enhances both sparsification and quantization. We evaluate the
performance-efficiency trade-offs of various design choices and finalize a
methodology tailored for the unified pattern. Our approach, **PAROAttention**,
achieves video and image generation with lossless metrics, and nearly identical
results from full-precision (FP) baselines, while operating at notably lower
density (~20%-30%) and bitwidth (**INT8/INT4**), achieving a **1.9x** to
**2.7x** end-to-end latency speedup.