混合エキスパートのための信頼度適応型SwiGLU

要旨

SwiGLUは現代のTransformer MLPにおける標準的なゲート付き活性化関数となっているが、そのゲートのシャープネス（ゲート関数の滑らかさと選択性）は通常、学習を通じて固定されている。本研究では、Mixture-of-Experts (MoE) モデル向けのSwiGLUの変種であるConfidence-Aware SwiGLU (κ-SwiGLU) を提案する。これはトークンレベルのルーティング信頼度に応じてエキスパートゲートのシャープネスを調整する。具体的には、κ-SwiGLUはSiLUゲートのシャープネス係数をルータロジットの学習可能な関数としてパラメータ化し、各エキスパートゲートユニットが滑らかで広範に活性化するゲーティングと、鋭く選択的なゲーティングの間で補間できるようにする。我々はκ-SwiGLUを、8層から28層のMoE Transformerモデルを用いてFineWeb-Eduデータセット上で評価した。これらの設定において、κ-SwiGLUは無視できる程度のパラメータ追加とわずかな計算オーバーヘッドのみで平均CORE性能を向上させ、信頼度を考慮したゲートのシャープネスがMoE MLPの改善に有望なメカニズムであることを示している。コードはhttps://github.com/askerlee/kappa-swigluで公開されている。

English

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.