注意力模式为何存在：统一的时间视角分析

摘要

注意力模式在大语言模型的训练与推理过程中具有关键作用。已有研究识别出检索头、汇聚头和对角线轨迹等独立模式，但这些观察仍呈碎片化状态，缺乏统一的理论解释。为弥补这一空白，我们提出时序注意力模式可预测性分析框架，该统一框架从时序连续性视角分析注意力机制的数学表达形式，从而解释各类注意力模式。TAPPA不仅深化了对注意力行为的理解，还为推理加速方法提供了理论指导。具体而言，该框架将注意力模式划分为具有明显规律性的可预测模式与呈现有效随机性的不可预测模式。我们进一步发现，这种区分可通过查询向量沿时间维度的自相似度来解释。针对可预测模式，我们通过查询向量、键向量与旋转位置编码的联合作用，对三种典型案例进行了详细数学分析。通过将TAPPA的洞见应用于KV缓存压缩和LLM剪枝任务，我们验证了该框架的有效性。在这些任务中，基于TAPPA设计的简易评估指标均能持续提升基线方法性能。代码已开源：https://github.com/MIRALab-USTC/LLM-TAPPA。

English

Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.

注意力模式为何存在：统一的时间视角分析

Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

摘要

Support