매니폴드 파워 반복을 통한 전문가 혼합 라우터 재설계

초록

라우터는 Mixture-of-Experts 모델의 핵심 구성 요소이다. 전문가 대리자(proxy) 역할을 수행하는 라우터 행렬의 각 행은 MoE 입력과의 유사도를 계산하여 활성화할 전문가 부분집합을 결정한다. 이상적으로 각 라우터 행은 전문가 행렬을 대표 벡터로 압축 인코딩하도록 설계되어, 해당 벡터와 토큰 간의 내적이 토큰-전문가 친화도를 더 잘 반영할 수 있어야 한다. 그러나 이러한 압축을 강제할 설계 원칙은 존재하지 않는다. 본 논문에서는 각 라우터 행을 해당 전문가의 주요 특이 방향(principal singular direction)과 정렬할 것을 제안한다. 이 방향은 행렬에 대한 가장 표현력 있는 수학적 기술을 제공하기 때문이다. 이 원칙에 기반하여, 매니폴드 거듭제곱 반복법(Manifold Power Iteration, MPI)을 통한 라우터 재설계를 제안한다. 구체적으로, 라우터 가중치에 거듭제곱 반복 단계를 수행한 후, 효율성과 안정성을 보장하기 위해 노름 제약을 부과하는 수축(retraction)을 적용하는 'Power-then-Retract' 패러다임을 도입한다. 이론적으로 MPI가 라우터 행을 해당 전문가의 주요 특이 방향으로 수렴하게 함을 증명한다. 실험적으로는 1B에서 11B 매개변수 규모에 걸쳐 MoE 모델을 사전 학습하여, 이러한 정렬이 더 효과적인 MoE 모델을 촉진함을 확인한다.

English

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.