多様体べき乗法による混合専門家ルーターの再設計

要旨

ルーターはMixture-of-Experts（MoE）モデルにおける基盤的構成要素である。エキスパートの代理として機能するルーター行列の各行は、MoEへの入力との類似度を計算し、活性化されるエキスパートのサブセットを決定する。理想的には、各ルーター行はエキスパート行列をこの代表ベクトルにエンコードするよう設計されており、そのトークンとのドット積がトークンとエキスパートの親和性をより適切に反映できるようにする。しかし、この凝縮を強制する設計原理は存在しない。本論文では、各ルーター行を関連するエキスパートの主特異方向に整列させることを提案する。この方向は行列の最も表現力豊かな数学的記述を提供するためである。この原理に基づき、我々は多様体べき乗反復法に基づくルーター再設計を提案する。具体的には、「べき乗-その後-リトラクション」パラダイムを導入し、ルーター重みにべき乗反復ステップを実行した後、リトラクションによってノルム制約を課し、効率性と安定性の両方を確保する。理論的には、MPIがルーター行を関連エキスパートの主特異方向へ収束させることを示す。実証的には、1Bから11Bパラメータの規模でMoEモデルを事前学習し、この整列がより効果的なMoEモデルを促進することを確認する。

English

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.