表徵轉移：統一令牌壓縮與閃爍注意力機制

摘要

Transformer模型在視覺、語言和視頻領域展現了卓越的成功。然而，隨著任務複雜度的增加，模型規模和token數量也隨之增長，這導致了自注意力機制的二次方計算成本上升以及GPU記憶體訪問的開銷增大。為了降低自注意力的計算成本，先前的研究提出了token壓縮技術，通過剔除冗餘或信息量較低的token來實現。同時，諸如FlashAttention之類的融合注意力核心已被開發出來，通過避免構建注意力圖及其相關的高頻寬記憶體（HBM）I/O操作來減輕記憶體開銷。然而，這使得其與大多數無需訓練的token壓縮方法不相容，因為這些方法依賴於注意力圖來確定token的重要性。在此，我們提出了表示偏移（Representation Shift），這是一種無需訓練、模型無關的度量標準，用於衡量每個token表示變化的程度。這使得token壓縮與FlashAttention無縫集成，無需注意力圖或重新訓練。我們的方法進一步推廣到Transformer之外的CNN和狀態空間模型。大量實驗表明，表示偏移能夠實現與FlashAttention兼容的有效token壓縮，在視頻文本檢索和視頻問答中分別帶來高達5.5%和4.4%的顯著加速。代碼可在https://github.com/mlvlab/Representation-Shift獲取。

English

Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

表徵轉移：統一令牌壓縮與閃爍注意力機制

Representation Shift: Unifying Token Compression with FlashAttention

摘要

Support