多專家模型中的多語言路由

摘要

專家混合（Mixture-of-Experts, MoE）架構已成為擴展現代大型語言模型（LLMs）的關鍵，然而人們對其稀疏路由動態如何應對多語言數據的理解仍十分有限。在本研究中，我們利用平行多語言數據集分析了專家路由模式，並展示了高度可解釋的層級現象。我們發現，MoE模型在解碼器的早期和後期層中以語言特定的方式路由詞元，但在中間層表現出顯著的跨語言路由一致性，這與在密集LLMs中觀察到的參數共享趨勢相呼應。特別是，我們揭示了模型在特定語言中的表現與這些層中其詞元與英語路由的相似性之間存在明確且強烈的相關性。除了相關性之外，我們還探索了在推理時進行干預以誘導更高跨語言路由一致性的方法。我們提出了一種通過促進在英語中頻繁激活的中間層任務專家來引導路由的方法，並成功提升了多語言性能。這些1-2%的性能提升在兩個評估任務、三個模型和15種以上的語言中表現出驚人的一致性，尤其是考慮到這些簡單的干預措施覆蓋了經過廣泛訓練的頂尖LLMs的路由器。相比之下，在中間層之外進行干預或針對多語言專家的干預只會導致性能下降。總體而言，我們提出了多項發現，解釋了MoE如何處理非英語文本，並證明了模型的泛化能力受限於其在所有語言中利用語言通用專家的能力。

English

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

多專家模型中的多語言路由

Multilingual Routing in Mixture-of-Experts

摘要

Support