專家自主模型

摘要

專家混合（Mixture-of-Experts，MoE）模型主要使用路由器將標記分配給特定專家模組，僅激活部分參數，通常優於密集模型。我們認為，路由器決策與專家執行之間的分離是一個關鍵但被忽視的問題，導致次優的專家選擇和無效的學習。為了解決這個問題，我們提出了專家自治（Autonomy-of-Experts，AoE），這是一種新穎的MoE範式，其中專家自主選擇自己來處理輸入。AoE基於一個洞察，即專家意識到自己有效處理標記的能力，這種意識體現在其內部激活的規模中。在AoE中，移除了路由器；相反，專家為輸入預先計算內部激活，並根據其激活範數進行排名。僅有排名靠前的專家進行前向傳遞，而其他專家則中止。通過低秩權重因子化，預先計算激活的開銷得以降低。這種自我評估然後與夥伴進行比較的方法確保了改進的專家選擇和有效的學習。我們對具有從700M到4B參數的語言模型進行了預訓練，表明AoE在效率上優於具有可比效率的傳統MoE模型。

English

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

專家自主模型

Autonomy-of-Experts Models

摘要

Support