Softpick：无需注意力汇聚，通过修正Softmax避免大规模激活

摘要

我们提出了softpick，一种经过修正、非归一化的替代方案，用于替换Transformer注意力机制中的softmax，有效消除了注意力汇聚点和大规模激活现象。在340M参数模型的实验中，softpick在标准基准测试上保持了与softmax相当的性能，同时实现了0%的汇聚率。采用softpick的Transformer生成的隐藏状态具有显著更低的峰度（340对比33,510），并创建了稀疏的注意力图（46.97%的稀疏度）。量化后，使用softpick的模型持续超越softmax，尤其在低比特精度下优势更为明显。我们的分析与讨论表明，softpick有望为量化、低精度训练、稀疏优化、剪枝及可解释性等领域开辟新的可能性。相关代码已发布于https://github.com/zaydzuhri/softpick-attention。

English

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M parameter models demonstrate that softpick maintains performance parity with softmax on standard benchmarks while achieving 0% sink rate. The softpick transformer produces hidden states with significantly lower kurtosis (340 vs 33,510) and creates sparse attention maps (46.97% sparsity). Models using softpick consistently outperform softmax when quantized, with particularly pronounced advantages at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention.

Softpick：无需注意力汇聚，通过修正Softmax避免大规模激活

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

摘要

Support