表征迁移:统一令牌压缩与FlashAttention技术
Representation Shift: Unifying Token Compression with FlashAttention
August 1, 2025
作者: Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
cs.AI
摘要
Transformer模型在视觉、语言和视频领域展现了卓越的成功。然而,随着任务复杂度的增加,模型规模与token数量不断膨胀,导致自注意力机制的计算成本呈二次方增长,GPU内存访问开销也随之上升。为降低自注意力计算成本,先前研究提出了token压缩技术,通过剔除冗余或信息量较低的token来实现。同时,诸如FlashAttention等融合注意力内核的开发,通过避免构建注意力矩阵及其与高带宽内存(HBM)的I/O操作,有效缓解了内存开销。但这也使得其与多数无需训练的token压缩方法不兼容,因为这些方法依赖注意力矩阵来确定token的重要性。为此,我们提出了“表征偏移”(Representation Shift),一种无需训练、模型无关的度量标准,用于衡量每个token表征的变化程度。该方法无需注意力矩阵或重新训练,即可无缝集成token压缩与FlashAttention。我们的方法进一步推广至Transformer之外的CNN及状态空间模型。大量实验表明,表征偏移实现了与FlashAttention兼容的有效token压缩,在视频-文本检索和视频问答任务中分别带来了高达5.5%和4.4%的显著加速。代码已发布于https://github.com/mlvlab/Representation-Shift。
English
Transformers have demonstrated remarkable success across vision, language,
and video. Yet, increasing task complexity has led to larger models and more
tokens, raising the quadratic cost of self-attention and the overhead of GPU
memory access. To reduce the computation cost of self-attention, prior work has
proposed token compression techniques that drop redundant or less informative
tokens. Meanwhile, fused attention kernels such as FlashAttention have been
developed to alleviate memory overhead by avoiding attention map construction
and its associated I/O to HBM. This, however, makes it incompatible with most
training-free token compression methods, which rely on attention maps to
determine token importance. Here, we propose Representation Shift, a
training-free, model-agnostic metric that measures the degree of change in each
token's representation. This seamlessly integrates token compression with
FlashAttention, without attention maps or retraining. Our method further
generalizes beyond Transformers to CNNs and state space models. Extensive
experiments show that Representation Shift enables effective token compression
compatible with FlashAttention, yielding significant speedups of up to 5.5% and
4.4% in video-text retrieval and video QA, respectively. Code is available at
https://github.com/mlvlab/Representation-Shift.