注意力パターンが存在する理由：統合的時間視点からの分析

要旨

大規模言語モデル（LLM）の学習と推論の両方において、アテンションパターンは重要な役割を果たす。従来の研究では、検索ヘッド、シンクヘッド、対角線トレースといった個々のパターンが特定されてきたが、これらの知見は断片的であり、統一的な説明を欠いていた。この隔たりを埋めるため、我々は時間的連続性の観点から基礎となる数学的定式化を分析することで多様なアテンションパターンを説明する統一フレームワーク、Temporal Attention Pattern Predictability Analysis (TAPPA) を提案する。TAPPAはアテンションの振る舞いへの理解を深めるだけでなく、推論高速化手法の指針ともなる。具体的には、TAPPAはアテンションパターンを、明確な規則性を持つ「予測可能なパターン」と、実質的にランダムに見える「予測不可能なパターン」として特徴づける。我々の分析はさらに、この区別が時間次元に沿ったクエリの自己相似性の度合いによって説明できることを明らかにする。予測可能なパターンに焦点を当て、クエリ、キー、および Rotary Positional Embeddings (RoPE) の共同効果を通じて、3つの代表的なケースに関する詳細な数学的分析をさらに提供する。我々は、TAPPAの知見をKVキャッシュ圧縮およびLLMプルーニングタスクに適用することでその有効性を検証する。これらのタスクにおいて、TAPPAに基づく単純な指標は、ベースライン手法を一貫して上回る性能向上をもたらした。コードは https://github.com/MIRALab-USTC/LLM-TAPPA で公開されている。

English

Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.

注意力パターンが存在する理由：統合的時間視点からの分析

Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

要旨

Support