대규모 언어 모델에서의 전문가 혼합(Mixture-of-Experts)에 대한 심층 분석

초록

전문가 혼합(Mixture-of-experts, MoE)은 특히 언어 작업에서 독특한 특성과 뛰어난 성능으로 인해 점점 더 많은 관심을 받고 있습니다. MoE 아키텍처는 각 토큰에 대해 매개변수의 일부만 희소하게 활성화함으로써, 계산 효율성을 희생하지 않으면서 모델 크기를 증가시킬 수 있으며, 성능과 훈련 비용 간의 더 나은 균형을 달성합니다. 그러나 MoE의 기본 메커니즘은 여전히 추가 탐구가 필요하며, 그 모듈화 정도는 의문시되고 있습니다. 본 논문에서는 MoE 기반 대규모 언어 모델의 내부 작동 방식을 이해하기 위한 초기 시도를 합니다. 구체적으로, 최근의 세 가지 MoE 기반 모델의 매개변수 및 행동 특성을 포괄적으로 연구하고, 다음과 같은 흥미로운 관찰 결과를 제시합니다: (1) 뉴런이 세분화된 전문가처럼 작동한다. (2) MoE의 라우터는 일반적으로 출력 노름이 더 큰 전문가를 선택한다. (3) 전문가 다양성은 층이 증가함에 따라 증가하지만, 마지막 층은 예외이다. 이러한 관찰 결과를 바탕으로, 라우터 설계 및 전문가 할당과 같은 다양한 MoE 실무자들을 위한 제안도 제공합니다. 이 연구가 MoE 프레임워크 및 기타 모듈식 아키텍처에 대한 향후 연구에 통찰을 제공할 수 있기를 바랍니다. 코드는 https://github.com/kamanphoebe/Look-into-MoEs에서 확인할 수 있습니다.

English

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

대규모 언어 모델에서의 전문가 혼합(Mixture-of-Experts)에 대한 심층 분석

A Closer Look into Mixture-of-Experts in Large Language Models

초록

Support