多专家混合模型中的多语言路由

摘要

专家混合（Mixture-of-Experts, MoE）架构已成为扩展现代大语言模型（LLMs）的关键，然而，其稀疏路由机制如何响应多语言数据却鲜为人知。本研究通过分析并行多语言数据集中的专家路由模式，揭示了高度可解释的层级现象。我们发现，MoE模型在解码器的早期和晚期层中以语言特定的方式路由标记，但在中间层展现出显著的跨语言路由一致性，这与在密集LLMs中观察到的参数共享趋势相呼应。特别是，我们揭示了一个明确且强烈的相关性：模型在某一语言中的表现与其标记在这些层中与英语路由的相似程度密切相关。超越相关性分析，我们探索了在推理时干预以增强跨语言路由一致性的方法。我们提出了一种通过促进在英语中频繁激活的中间层任务专家来引导路由器的策略，该策略成功提升了多语言性能。这些1-2%的性能提升在两项评估任务、三种模型及超过15种语言中表现出惊人的一致性，尤其是考虑到这些简单干预覆盖了经过广泛训练、处于领先水平的LLMs的路由器。相比之下，在中间层之外进行干预或针对多语言专用专家的尝试仅导致性能下降。总之，我们呈现了多项发现，解释了MoE如何处理非英语文本，并证明模型的泛化能力受限于其能否在所有语言中利用语言通用专家。

English

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

多专家混合模型中的多语言路由

Multilingual Routing in Mixture-of-Experts

摘要

Support