利用流形幂迭代重新设计混合专家路由器

摘要

路由器是混合专家模型中的基石组件。作为专家的代理，路由器矩阵的各行计算其与MoE输入的相似度，以决定激活哪些专家子集。理想情况下，每一行路由器旨在将专家矩阵编码为这个代表性向量，从而使其与token的点积能更好地反映token与专家之间的亲和度。然而，目前尚无设计原则来强制实现这种压缩。在本文中，我们提出将每个路由器行与相关专家的主奇异方向对齐，因为该方向提供了矩阵最具表达力的数学描述。基于这一原则，我们提出了一种使用流形幂迭代（MPI）的路由器重新设计。具体来说，它引入了"先幂迭代后收缩"的范式，即在路由器权重上执行幂迭代步骤，然后通过收缩施加范数约束，以确保效率和稳定性。理论上，我们证明MPI驱动路由器行收敛到相关专家的主奇异方向。实验上，我们在1B到11B参数的多个规模上预训练MoE模型，以证实这种对齐有助于构建更有效的MoE模型。

English

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.