注意力汇聚机制在Softmax Transformer中的必要性验证：来自触发条件任务的证据

摘要

Transformer模型常表现出注意力汇聚现象：概率质量会集中在某个固定的、与内容无关的位置上。我们证明，在softmax自注意力模型中计算简单的触发条件行为必然引发汇聚效应。研究结果将一种常见直觉形式化：在概率单纯形上的归一化操作会迫使注意力坍缩到稳定锚点上，以实现默认状态（例如当模型需要忽略输入时）。我们通过具体任务实例化这一现象：当出现指定触发标记时，模型必须返回所有前继标记表征的平均值，否则输出零值——该任务模拟了实际注意力头的工作机制（Barbero等人，2025；Guo等人，2024）。同时我们证明，未经归一化的ReLU注意力可在不产生任何汇聚的情况下完成相同任务，证实了归一化约束是引发汇聚行为的根本原因。实验验证了我们的预测，并证明该现象超越理论分析场景：softmax模型会产生强烈汇聚，而ReLU注意力在单头与多头变体中均能消除汇聚现象。

English

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.