多专家混合模型中的多语言路由
Multilingual Routing in Mixture-of-Experts
October 6, 2025
作者: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng
cs.AI
摘要
专家混合(Mixture-of-Experts, MoE)架构已成为扩展现代大语言模型(LLMs)的关键,然而,其稀疏路由机制如何响应多语言数据却鲜为人知。本研究通过分析并行多语言数据集中的专家路由模式,揭示了高度可解释的层级现象。我们发现,MoE模型在解码器的早期和晚期层中以语言特定的方式路由标记,但在中间层展现出显著的跨语言路由一致性,这与在密集LLMs中观察到的参数共享趋势相呼应。特别是,我们揭示了一个明确且强烈的相关性:模型在某一语言中的表现与其标记在这些层中与英语路由的相似程度密切相关。超越相关性分析,我们探索了在推理时干预以增强跨语言路由一致性的方法。我们提出了一种通过促进在英语中频繁激活的中间层任务专家来引导路由器的策略,该策略成功提升了多语言性能。这些1-2%的性能提升在两项评估任务、三种模型及超过15种语言中表现出惊人的一致性,尤其是考虑到这些简单干预覆盖了经过广泛训练、处于领先水平的LLMs的路由器。相比之下,在中间层之外进行干预或针对多语言专用专家的尝试仅导致性能下降。总之,我们呈现了多项发现,解释了MoE如何处理非英语文本,并证明模型的泛化能力受限于其能否在所有语言中利用语言通用专家。
English
Mixture-of-Experts (MoE) architectures have become the key to scaling modern
LLMs, yet little is understood about how their sparse routing dynamics respond
to multilingual data. In this work, we analyze expert routing patterns using
parallel multilingual datasets and present highly interpretable layer-wise
phenomena. We find that MoE models route tokens in language-specific ways in
the early and late decoder layers but exhibit significant cross-lingual routing
alignment in middle layers, mirroring parameter-sharing trends observed in
dense LLMs. In particular, we reveal a clear, strong correlation between a
model's performance in a given language and how similarly its tokens are routed
to English in these layers. Extending beyond correlation, we explore
inference-time interventions that induce higher cross-lingual routing
alignment. We introduce a method that steers the router by promoting
middle-layer task experts frequently activated in English, and it successfully
increases multilingual performance. These 1-2% gains are remarkably consistent
across two evaluation tasks, three models, and 15+ languages, especially given
that these simple interventions override routers of extensively trained,
state-of-the-art LLMs. In comparison, interventions outside of the middle
layers or targeting multilingual-specialized experts only yield performance
degradation. Altogether, we present numerous findings that explain how MoEs
process non-English text and demonstrate that generalization is limited by the
model's ability to leverage language-universal experts in all languages.