다언어 라우팅을 활용한 전문가 혼합 모델

초록

전문가 혼합(Mixture-of-Experts, MoE) 아키텍처는 현대의 대규모 언어 모델(LLM)을 확장하는 데 핵심적인 역할을 하고 있지만, 이들의 희소 라우팅 동작이 다국어 데이터에 어떻게 반응하는지에 대해서는 거의 알려져 있지 않습니다. 본 연구에서는 병렬 다국어 데이터셋을 사용하여 전문가 라우팅 패턴을 분석하고, 계층별로 매우 해석 가능한 현상을 제시합니다. 우리는 MoE 모델이 초기 및 후기 디코더 계층에서는 언어별로 토큰을 라우팅하지만, 중간 계층에서는 상당한 교차 언어 라우팅 정렬을 보이며, 이는 밀집 LLM에서 관찰된 매개변수 공유 경향을 반영한다는 것을 발견했습니다. 특히, 특정 언어에서 모델의 성능과 해당 언어의 토큰이 영어와 유사하게 라우팅되는 정도 사이에 명확하고 강력한 상관관계가 있음을 밝혔습니다. 상관관계를 넘어, 우리는 추론 시 교차 언어 라우팅 정렬을 유도하는 개입을 탐구합니다. 우리는 영어에서 자주 활성화되는 중간 계층 작업 전문가를 촉진하여 라우터를 조종하는 방법을 소개하고, 이를 통해 다국어 성능을 성공적으로 향상시켰습니다. 이러한 1-2%의 성능 향상은 두 가지 평가 작업, 세 가지 모델, 그리고 15개 이상의 언어에서 매우 일관되게 나타났으며, 특히 이러한 간단한 개입이 철저히 훈련된 최첨단 LLM의 라우터를 재정의한다는 점을 고려할 때 주목할 만합니다. 반면, 중간 계층 외부에서의 개입이나 다국어 전문가를 대상으로 한 개입은 오히려 성능 저하를 초래했습니다. 종합적으로, 우리는 MoE가 비영어 텍스트를 처리하는 방식을 설명하는 여러 발견을 제시하고, 모든 언어에서 언어-보편적 전문가를 활용하는 모델의 능력이 일반화를 제한한다는 것을 입증했습니다.

English

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

다언어 라우팅을 활용한 전문가 혼합 모델

Multilingual Routing in Mixture-of-Experts

초록

Support