ソフトマックスTransformerにおけるアテンションシンクの必要性が証明される：トリガー条件付きタスクからの証拠

要旨

トランスフォーマーはしばしばアテンションシンク（attention sink）を示す。すなわち、確率質量が固定的でコンテンツに依存しない位置に集中する現象である。本論文では、単純なトリガー条件付きの振る舞いを計算することが、ソフトマックス自己注意モデルにおいて必然的にシンクを誘発することを証明する。我々の結果は、よく知られた直感を形式化するものである。つまり、確率単体上の正規化は、デフォルト状態（例えば、モデルが入力を無視する必要がある場合）を実現するために、注意を安定したアンカーに収束させることを強制するはずである。我々はこれを具体的なタスクで例示する。指定されたトリガートークンが出現した場合、モデルは先行する全てのトークン表現の平均を返さなければならず、それ以外の場合はゼロを出力する。このタスクは、実際のアテンションヘッドの機能を反映している（Barbero et al., 2025; Guo et al., 2024）。さらに我々は、正規化されていないReLU注意が、シンクを一切伴わずに同じタスクを解決できることを証明し、正規化の制約がシンク行動の根本的な要因であることを確認する。実験により我々の予測が検証され、それらが理論的に分析された設定を超えて拡張されることが実証される。すなわち、ソフトマックスモデルは強力なシンクを発達させるのに対し、ReLU注意は単一ヘッド及び多頭バリアントの両方においてシンクを排除する。

English

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

ソフトマックスTransformerにおけるアテンションシンクの必要性が証明される：トリガー条件付きタスクからの証拠

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

要旨

Support