専門家の自律性モデル

要旨

Mixture-of-Experts（MoE）モデルは、トークンを特定の専門モジュールに割り当てるためにルーターを主に使用し、部分的なパラメータのみを活性化させ、しばしば密なモデルを上回ります。我々は、ルーターの意思決定と専門家の実行との分離が重要でありながら見過ごされている問題であり、最適でない専門家の選択と効果的な学習をもたらすと主張します。この問題に対処するために、Autonomy-of-Experts（AoE）を提案します。これは、専門家が自律的に入力を処理するために自ら選択する革新的なMoEパラダイムです。AoEは、専門家がトークンを効果的に処理する能力について自覚しており、その自己活性化のスケールに反映されるという洞察に基づいています。AoEでは、ルーターが取り除かれ、代わりに専門家が入力のための内部活性化を事前計算し、その活性化ノルムに基づいてランク付けされます。上位ランクの専門家のみが前進パスを続行し、他の専門家は中止します。活性化の事前計算のオーバーヘッドは、低ランクの重み因数分解によって削減されます。この自己評価してからパートナー比較するアプローチにより、改善された専門家選択と効果的な学習が確保されます。我々は、7億から40億のパラメータを持つ言語モデルを事前トレーニングし、AoEが効率に比して従来のMoEモデルを上回ることを実証しています。

English

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

専門家の自律性モデル

Autonomy-of-Experts Models

要旨

Support