Softpick：無需注意力匯聚，無需大規模激活，使用修正Softmax

摘要

我們提出了softpick，這是一種經過校正、無需歸一化的替代方案，可直接替換transformer注意力機制中的softmax，從而消除注意力匯聚點和過度激活的問題。我們在340M參數模型上的實驗表明，softpick在標準基準測試中保持了與softmax相當的性能，同時實現了0%的匯聚率。使用softpick的transformer生成的隱藏狀態具有顯著較低的峰度（340對比33,510），並產生了稀疏的注意力圖（46.97%的稀疏度）。在量化後，採用softpick的模型始終優於使用softmax的模型，尤其是在較低比特精度下優勢更為明顯。我們的分析與討論展示了softpick如何為量化、低精度訓練、稀疏性優化、剪枝和可解釋性開闢新的可能性。我們的代碼可在https://github.com/zaydzuhri/softpick-attention獲取。

English

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M parameter models demonstrate that softpick maintains performance parity with softmax on standard benchmarks while achieving 0% sink rate. The softpick transformer produces hidden states with significantly lower kurtosis (340 vs 33,510) and creates sparse attention maps (46.97% sparsity). Models using softpick consistently outperform softmax when quantized, with particularly pronounced advantages at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention.

Softpick：無需注意力匯聚，無需大規模激活，使用修正Softmax

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

摘要

Support