전문가 혼합을 위한 신뢰도 적응형 SwiGLU

초록

SwiGLU는 현대 트랜스포머 MLP에서 표준 게이티드 활성화 함수가 되었지만, 그 게이트 예리도(gate sharpness), 즉 게이팅 함수의 부드러움과 선택성은 일반적으로 훈련 과정 전반에 걸쳐 고정되어 있다. 본 연구에서는 혼합 전문가(MoE) 모델을 위한 SwiGLU의 변형인 Confidence-Aware SwiGLU (κ-SwiGLU)를 제안하며, 이는 토큰 수준의 라우팅 신뢰도에 따라 전문가 게이트 예리도를 조정한다. 구체적으로, κ-SwiGLU는 SiLU 게이트 예리도 계수를 라우터 로짓의 학습 가능한 함수로 매개변수화하여, 각 전문가 게이트 유닛이 부드럽고 광범위하게 활성화되는 게이팅과 날카롭고 선택적인 게이팅 사이를 보간할 수 있도록 한다. 우리는 κ-SwiGLU를 8층에서 28층까지의 MoE 트랜스포머 모델에 대해 FineWeb-Edu 데이터셋에서 평가했다. 이러한 설정 전반에 걸쳐 κ-SwiGLU는 무시할 수 있는 수준의 매개변수만 추가하고 약간의 계산 오버헤드만 발생시키면서 평균 CORE 성능을 향상시켜, 신뢰도 인식 게이트 예리도가 MoE MLP를 개선하는 유망한 메커니즘임을 보여준다. 코드는 https://github.com/askerlee/kappa-swiglu에서 확인할 수 있다.

English

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.