Softpick: アテンションシンク不要、Rectified Softmaxによる大規模な活性化の回避

要旨

我々は、Transformerのアテンションメカニズムにおけるsoftmaxの代替として、softpickを提案します。softpickは正規化された非合計1の関数であり、アテンションシンクと大規模な活性化を排除します。3億4千万パラメータのモデルを用いた実験では、softpickが標準ベンチマークにおいてsoftmaxと同等の性能を維持しつつ、0%のシンク率を達成することを示しました。softpick Transformerは、隠れ状態の尖度を大幅に低減し（340対33,510）、スパースなアテンションマップを生成します（46.97%のスパース性）。softpickを使用したモデルは、量子化時に一貫してsoftmaxを上回り、特に低ビット精度において顕著な優位性を示しました。我々の分析と議論は、softpickが量子化、低精度トレーニング、スパース性最適化、プルーニング、解釈可能性において新たな可能性を開く潜在性を持つことを示しています。コードはhttps://github.com/zaydzuhri/softpick-attentionで公開されています。

English

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M parameter models demonstrate that softpick maintains performance parity with softmax on standard benchmarks while achieving 0% sink rate. The softpick transformer produces hidden states with significantly lower kurtosis (340 vs 33,510) and creates sparse attention maps (46.97% sparsity). Models using softpick consistently outperform softmax when quantized, with particularly pronounced advantages at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention.

Softpick: アテンションシンク不要、Rectified Softmaxによる大規模な活性化の回避

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

要旨

Support