PAROAttention: 視覚生成モデルにおける効率的なスパースおよび量子化アテンションのためのパターン認識リオーダリング

要旨

視覚生成において、アテンションメカニズムの二次的な複雑さは、特に高解像度画像やマルチフレーム動画生成に必要な長いトークンシーケンスにおいて、高いメモリと計算コストを引き起こします。この問題に対処するため、これまでの研究ではスパース化や量子化などの技術が探求されてきました。しかし、これらの技術は低密度やビット幅の縮小下で重大な課題に直面します。体系的な分析を通じて、その核心的な難しさが視覚的アテンションパターンの分散的で不規則な特性に起因することを特定しました。したがって、そのようなパターンに対応するために特別なスパース化や量子化設計を導入する代わりに、アテンションパターンを*再編成*するという代替戦略を提案します。視覚的特徴抽出の局所的な集約性に着想を得て、多様なアテンションパターンをハードウェアに適したブロック単位のパターンに統一する新たな**Pattern-Aware token ReOrdering (PARO)**技術を設計しました。この統一により、スパース化と量子化が大幅に簡素化され、向上します。さまざまな設計選択の性能と効率のトレードオフを評価し、統一されたパターンに適した方法論を確立しました。我々のアプローチである**PAROAttention**は、完全精度（FP）ベースラインとほぼ同等の結果を維持しつつ、著しく低い密度（約20%-30%）とビット幅（**INT8/INT4**）で動作し、エンドツーエンドのレイテンシを**1.9倍**から**2.7倍**高速化します。

English

In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs, especially for longer token sequences required in high-resolution image or multi-frame video generation. To address this, prior research has explored techniques such as sparsification and quantization. However, these techniques face significant challenges under low density and reduced bitwidths. Through systematic analysis, we identify that the core difficulty stems from the dispersed and irregular characteristics of visual attention patterns. Therefore, instead of introducing specialized sparsification and quantization design to accommodate such patterns, we propose an alternative strategy: *reorganizing* the attention pattern to alleviate the challenges. Inspired by the local aggregation nature of visual feature extraction, we design a novel **Pattern-Aware token ReOrdering (PARO)** technique, which unifies the diverse attention patterns into a hardware-friendly block-wise pattern. This unification substantially simplifies and enhances both sparsification and quantization. We evaluate the performance-efficiency trade-offs of various design choices and finalize a methodology tailored for the unified pattern. Our approach, **PAROAttention**, achieves video and image generation with lossless metrics, and nearly identical results from full-precision (FP) baselines, while operating at notably lower density (~20%-30%) and bitwidth (**INT8/INT4**), achieving a **1.9x** to **2.7x** end-to-end latency speedup.

PAROAttention: 視覚生成モデルにおける効率的なスパースおよび量子化アテンションのためのパターン認識リオーダリング

PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models

要旨

Support