주의 패턴이 존재하는 이유: 통합적 시간 관점 분석

초록

주의력 패턴은 대규모 언어 모델(LLM)의 학습과 추론 모두에서 중요한 역할을 합니다. 선행 연구에서는 검색 헤드, 싱크 헤드, 대각선 흔적과 같은 개별 패턴들을 확인했지만, 이러한 관찰들은 단편적으로 남아있으며 통합적인 설명이 부족합니다. 이러한 격차를 해소하기 위해 우리는 시간적으로 연속적인 관점에서 다양한 주의력 패턴의 수학적 형식을 분석하여 설명하는 통합 프레임워크인 TAPPA를 제안합니다. TAPPA는 주의력 행동에 대한 이해를 심화시키고 추론 가속화 접근법을 안내합니다. 구체적으로, TAPPA는 주의력 패턴을 명확한 규칙성을 가진 예측 가능한 패턴과 효과적으로 무작위적으로 보이는 예측 불가능한 패턴으로 특징짓습니다. 우리의 분석은 이러한 구분이 시간 차원을 따른 쿼리 자기 유사성의 정도로 설명될 수 있음을 추가로 밝힙니다. 예측 가능한 패턴에 집중하여, 우리는 쿼리, 키, 그리고 로터리 위치 임베딩(RoPE)의 결합적 효과를 통해 세 가지 대표적인 사례에 대한 상세한 수학적 분석을 제공합니다. 우리는 TAPPA의 통찰력을 KV 캐시 압축 및 LLM 프루닝 작업에 적용하여 이를 검증합니다. 이러한 작업들 전반에 걸쳐, TAPPA에서 동기를 얻은 간단한 메트릭이 기준 방법들 대비 일관적으로 성능을 향상시킵니다. 코드는 https://github.com/MIRALab-USTC/LLM-TAPPA에서 확인할 수 있습니다.

English

Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.

주의 패턴이 존재하는 이유: 통합적 시간 관점 분석

Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

초록

Support