混合專家模型的置信度自適應SwiGLU

摘要

SwiGLU 已成為現代 Transformer MLP 中的標準閘控激活函數，然而其閘控銳度——即閘控函數的平滑性與選擇性——通常在訓練過程中保持固定。本文針對混合專家（MoE）模型提出 Confidence-Aware SwiGLU（κ-SwiGLU），可根據 token 層級的路由置信度調整專家閘控銳度。具體而言，κ-SwiGLU 將 SiLU 閘控銳度係數參數化為路由 logit 的可學習函數，使每個專家閘控單元能在平滑廣激活閘控與銳利選擇性閘控之間進行插值。我們在 FineWeb-Edu 資料集上，針對層數從 8 到 28 層的 MoE Transformer 模型評估 κ-SwiGLU。在這些設定下，κ-SwiGLU 提升了平均 CORE 效能，同時僅增加可忽略的參數並僅引入少量計算開銷，證明了基於置信度的閘控銳度是改善 MoE MLP 的一項有前景的機制。程式碼已開源於 https://github.com/askerlee/kappa-swiglu。

English

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.