전문가 모델을 위한 계층별 순환 라우터

초록

대형 언어 모델의 확장은 다양한 작업에서 그들의 능력을 혁신적으로 향상시켰지만, 이러한 성장은 효율적인 계산 전략과 일치해야 합니다. 전문가 집합(Mixture-of-Experts, MoE) 아키텍처는 모델 크기를 확장하면서 훈련 비용을 크게 증가시키지 않고 확장할 수 있는 능력으로 빛을 발합니다. 그러나 현재의 MoE 모델은 종종 매개 변수의 비효율성을 나타냅니다. 예를 들어, 520억 개의 매개 변수를 가진 사전 훈련된 MoE 기반 대형 언어 모델은 67억 개의 매개 변수를 가진 표준 모델과 유사한 성능을 발휘할 수 있습니다. MoE의 중요한 부분인 현재의 라우터는 서로 다른 레이어에서 토큰을 독립적으로 할당하면서 과거의 라우팅 정보를 활용하지 않아 최적이 아닌 토큰-전문가 조합과 매개 변수의 비효율성 문제로 이어질 수 있습니다. 이 문제를 완화하기 위해 우리는 전문가 집합을 위한 레이어별 순환 라우터(RMoE)를 소개합니다. RMoE는 게이트 순환 유닛(Gated Recurrent Unit, GRU)을 활용하여 연속적인 레이어 간의 라우팅 결정 사이의 종속성을 설정합니다. 이러한 레이어별 순환은 입력 토큰에 대해 효율적으로 병렬로 계산될 수 있으며 협상 가능한 비용을 도입합니다. 우리의 방대한 경험적 평가는 RMoE 기반 언어 모델이 일관되게 여러 기준 모델을 능가함을 입증합니다. 더 나아가, RMoE는 기존 방법과 직교하는 새로운 계산 단계를 통합하여 다른 MoE 아키텍처와의 원활한 호환성을 제공합니다. 우리의 분석은 RMoE의 이익을 효과적인 교차 레이어 정보 공유에 기인하며, 이는 전문가 선택과 다양성을 향상시킵니다. 우리의 코드는 https://github.com/qiuzh20/RMoE 에 있습니다.

English

The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE's gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMoE

전문가 모델을 위한 계층별 순환 라우터

Layerwise Recurrent Router for Mixture-of-Experts

초록

Summary

Support

Support