面向专家混合的置信自适应SwiGLU

摘要

SwiGLU已成为现代Transformer MLP中的标准门控激活函数，但其门控锐度——即门控函数的平滑性与选择性——在训练过程中通常固定不变。本文提出一种针对混合专家（MoE）模型的SwiGLU变体——置信度感知型SwiGLU（κ-SwiGLU），该变体根据词元级别的路由置信度动态调整专家门控锐度。具体而言，κ-SwiGLU将SiLU门控锐度系数参数化为路由器对数几率（logit）的可学习函数，使每个专家门控单元能够在平滑宽泛激活与锐利选择性激活之间进行插值。我们在FineWeb-Edu数据集上对8至28层的MoE Transformer模型进行了评估。在多种设置下，κ-SwiGLU在仅增加少量参数且仅产生微小计算开销的前提下提升了平均CORE性能，表明置信度感知的门控锐度是改进MoE MLP的一种有前景的机制。代码已开源至https://github.com/askerlee/kappa-swiglu。

English

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.