Mixture-of-Expertsにおける多言語ルーティング

要旨

Mixture-of-Experts（MoE）アーキテクチャは、現代の大規模言語モデル（LLM）のスケーリングにおいて鍵となっていますが、そのスパースなルーティングダイナミクスが多言語データにどのように応答するかについてはほとんど理解されていません。本研究では、並列多言語データセットを用いてエキスパートルーティングパターンを分析し、非常に解釈可能な層ごとの現象を提示します。MoEモデルは、初期および後期のデコーダ層では言語固有の方法でトークンをルーティングしますが、中間層では顕著なクロスリンガルルーティングの整合性を示し、密なLLMで観察されるパラメータ共有の傾向を反映しています。特に、特定の言語におけるモデルのパフォーマンスと、これらの層で英語と同様にトークンがルーティングされる度合いとの間に明確で強い相関関係があることを明らかにします。相関関係を超えて、クロスリンガルルーティングの整合性を高める推論時の介入を探求します。英語で頻繁に活性化される中間層のタスクエキスパートを促進することでルーターを誘導する手法を導入し、多言語パフォーマンスを向上させることに成功しました。これらの1-2％の向上は、2つの評価タスク、3つのモデル、および15以上の言語にわたって驚くほど一貫しており、特にこれらの単純な介入が、高度に訓練された最先端のLLMのルーターを上書きすることを考えると注目に値します。比較すると、中間層以外での介入や多言語専門のエキスパートをターゲットにした介入は、パフォーマンスの低下をもたらすのみです。全体として、MoEが非英語テキストをどのように処理するかを説明する多くの知見を提示し、一般化がモデルの能力によって制限されること、すなわちすべての言語で言語普遍的なエキスパートを活用する能力によって制限されることを実証します。

English

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

Mixture-of-Expertsにおける多言語ルーティング

Multilingual Routing in Mixture-of-Experts

要旨

Support