表現シフト：トークン圧縮とFlashAttentionの統合

要旨

Transformerは、視覚、言語、およびビデオの分野で顕著な成功を収めてきました。しかし、タスクの複雑さが増すにつれて、モデルの規模が大きくなり、トークン数も増加し、自己注意機構の二次的なコストとGPUメモリアクセスのオーバーヘッドが問題となっています。自己注意機構の計算コストを削減するために、これまでの研究では、冗長または情報量の少ないトークンを削除するトークン圧縮技術が提案されてきました。一方で、FlashAttentionのような融合注意カーネルは、注意マップの構築とそれに関連するHBMへのI/Oを回避することで、メモリオーバーヘッドを軽減するために開発されています。しかし、これにより、注意マップに依存してトークンの重要性を決定するほとんどのトレーニング不要なトークン圧縮手法との互換性が失われています。ここでは、各トークンの表現の変化の程度を測定する、トレーニング不要でモデルに依存しない指標であるRepresentation Shiftを提案します。これにより、注意マップや再トレーニングなしで、トークン圧縮をFlashAttentionとシームレスに統合することが可能になります。さらに、本手法はTransformerを超えてCNNや状態空間モデルにも一般化されます。広範な実験により、Representation ShiftがFlashAttentionと互換性のある効果的なトークン圧縮を可能にし、ビデオテキスト検索とビデオQAにおいてそれぞれ最大5.5%と4.4%の大幅な高速化を実現することが示されています。コードはhttps://github.com/mlvlab/Representation-Shiftで公開されています。

English

Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

表現シフト：トークン圧縮とFlashAttentionの統合

Representation Shift: Unifying Token Compression with FlashAttention

要旨

Support