以流形冪迭代重新設計混合專家路由器

摘要

路由器是混合专家模型（Mixture-of-Experts）的核心组件。作为专家代理，路由器矩阵的各行通过计算与MoE输入的相似度，决定激活哪些专家子集。理想情况下，每个路由行应能将专家矩阵编码为代表向量，使其与token的点积更好地反映token与专家的亲和度。然而，目前尚无设计准则来强制实现这种压缩。本文提出将每个路由行与对应专家的主奇异方向对齐，因为该方向提供了矩阵最具表达力的数学描述。基于此原则，我们提出采用流形幂迭代（Manifold Power Iteration, MPI）重新设计路由器。具体而言，该方法引入"幂迭代-再收缩"（Power-then-Retract）范式：先对路由权重执行幂迭代步骤，再通过收缩施加范数约束，以确保效率与稳定性。理论上，我们证明MPI能驱动路由行收敛至对应专家的主奇异方向。实验上，我们预训练了从1B到11B参数规模的MoE模型，证实这种对齐有助于构建更高效的MoE模型。

English

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.