表徵轉移:統一令牌壓縮與閃爍注意力機制
Representation Shift: Unifying Token Compression with FlashAttention
August 1, 2025
作者: Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
cs.AI
摘要
Transformer模型在視覺、語言和視頻領域展現了卓越的成功。然而,隨著任務複雜度的增加,模型規模和token數量也隨之增長,這導致了自注意力機制的二次方計算成本上升以及GPU記憶體訪問的開銷增大。為了降低自注意力的計算成本,先前的研究提出了token壓縮技術,通過剔除冗餘或信息量較低的token來實現。同時,諸如FlashAttention之類的融合注意力核心已被開發出來,通過避免構建注意力圖及其相關的高頻寬記憶體(HBM)I/O操作來減輕記憶體開銷。然而,這使得其與大多數無需訓練的token壓縮方法不相容,因為這些方法依賴於注意力圖來確定token的重要性。在此,我們提出了表示偏移(Representation Shift),這是一種無需訓練、模型無關的度量標準,用於衡量每個token表示變化的程度。這使得token壓縮與FlashAttention無縫集成,無需注意力圖或重新訓練。我們的方法進一步推廣到Transformer之外的CNN和狀態空間模型。大量實驗表明,表示偏移能夠實現與FlashAttention兼容的有效token壓縮,在視頻文本檢索和視頻問答中分別帶來高達5.5%和4.4%的顯著加速。代碼可在https://github.com/mlvlab/Representation-Shift獲取。
English
Transformers have demonstrated remarkable success across vision, language,
and video. Yet, increasing task complexity has led to larger models and more
tokens, raising the quadratic cost of self-attention and the overhead of GPU
memory access. To reduce the computation cost of self-attention, prior work has
proposed token compression techniques that drop redundant or less informative
tokens. Meanwhile, fused attention kernels such as FlashAttention have been
developed to alleviate memory overhead by avoiding attention map construction
and its associated I/O to HBM. This, however, makes it incompatible with most
training-free token compression methods, which rely on attention maps to
determine token importance. Here, we propose Representation Shift, a
training-free, model-agnostic metric that measures the degree of change in each
token's representation. This seamlessly integrates token compression with
FlashAttention, without attention maps or retraining. Our method further
generalizes beyond Transformers to CNNs and state space models. Extensive
experiments show that Representation Shift enables effective token compression
compatible with FlashAttention, yielding significant speedups of up to 5.5% and
4.4% in video-text retrieval and video QA, respectively. Code is available at
https://github.com/mlvlab/Representation-Shift.