Demistificare lo Schema Barra nell'Attention: Il Ruolo del RoPE

Abstract

I Modelli Linguistici di Grande Dimensione (LLM) mostrano spesso pattern di attenzione "slash", in cui i punteggi di attenzione si concentrano lungo la Δ-esima sub-diagonale per un certo offset Δ. Questi pattern svolgono un ruolo chiave nel trasferire informazioni tra i token. Ma perché emergono? In questo articolo, demistifichiamo l'emergere di queste Teste a Dominanza Slash (SDH) da prospettive sia empiriche che teoriche. In primo luogo, analizzando LLM open-source, scopriamo che le SDH sono intrinseche ai modelli e si generalizzano a prompt fuori distribuzione. Per spiegare l'emergenza intrinseca, analizziamo le query, le key e l'Incorporamento Posizionale Rotatorio (RoPE), che determinano congiuntamente i punteggi di attenzione. La nostra analisi empirica rivela due condizioni caratteristiche delle SDH: (1) Le query e le key sono quasi di rango uno, e (2) Il RoPE è dominato da componenti a frequenza media e alta. In queste condizioni, le query e le key sono quasi identiche tra i token, e le interazioni tra le componenti a frequenza media e alta del RoPE danno origine alle SDH. Oltre all'evidenza empirica, mostriamo teoricamente che queste condizioni sono sufficienti a garantire l'emergere delle SDH formalizzandole come nostre assunzioni di modellazione. In particolare, analizziamo la dinamica di addestramento di un Transformer superficiale equipaggiato con RoPE sotto queste condizioni, e dimostriamo che i modelli addestrati tramite discesa del gradiente esibiscono SDH. Le SDH si generalizzano a prompt fuori distribuzione.

English

Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the Δ-th sub-diagonal for some offset Δ. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.

Demistificare lo Schema Barra nell'Attention: Il Ruolo del RoPE

Demystifying the Slash Pattern in Attention: The Role of RoPE

Abstract

Support