UniPool: 혼합 전문가 모델을 위한 글로벌 공유 전문가 풀

초록

현대적인 MoE(전문가 혼합) 아키텍처는 각 트랜스포머 계층이 별도의 전문가 집합을 소유하는 엄격한 계층별 규칙을 통해 전문가 용량을 할당합니다. 이러한 관행은 깊이 확장과 선형적인 전문가 매개변수 증가를 결합하며, 모든 계층이 분리된 전문가 용량을 필요로 한다고 가정합니다. 그러나 최근 분석과 우리의 라우팅 탐사 결과는 이 할당 규칙에 의문을 제기합니다. 여러 실제 운영 MoE 모델에서 더 깊은 계층의 학습된 상위-k 라우터를 균일 무작위 라우팅으로 대체해도 하류 작업 정확도가 1.0-1.6점 밖에 떨어지지 않습니다. 이러한 중복성에 착안하여, 우리는 전문가 용량을 글로벌 아키텍처 예산으로 취급하는 UniPool 아키텍처를 제안합니다. 이는 계층별 전문가 소유권을 독립적인 계층별 라우터가 접근하는 단일 공유 풀로 대체합니다. 공유 환경에서 안정적이고 균형 잡힌 학습을 가능하게 하기 위해, 전체 풀 전체에서 전문가 활용도를 균형 있게 조절하는 풀 수준 보조 손실을 도입하고, 공유 전문가 풀에 대한 희소성 및 규모 안정성 라우팅을 제공하는 NormRouter를 채택합니다. Pile 데이터셋의 300억 토큰으로 학습된 다섯 가지 LLaMA 아키텍처 모델 규모(182M, 469M, 650M, 830M, 978M 매개변수)에서 UniPool은 대응되는 일반 MoE 기준 모델 대비 검증 손실과 복잡도를 지속적으로 개선했습니다. 이러한 규모 전반에 걸쳐 UniPool은 일반 MoE 대비 검증 손실을 최대 0.0386까지 감소시켰습니다. 원시 손실 개선을 넘어, 우리의 결과는 풀 크기를 명시적인 깊이 확장 하이퍼파라미터로 규명합니다. 일반 전문가 매개변수 예산의 41.6%~66.7%만 사용하는 축소 풀 UniPool 변형이 테스트된 규모에서 계층별 MoE와 성능이 동등하거나 더 우수했습니다. 이는 공유 풀 설계 하에서는 전문가 매개변수가 깊이에 따라 선형적으로 증가할 필요가 없으며, 일반 MoE보다 더 효율적이고 효과적으로 유지되면서도 준선형적으로 증가할 수 있음을 보여줍니다. 추가 분석에 따르면 UniPool의 이점은 더 세분화된 전문가 분해와 결합되어 적용됩니다.

English

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

UniPool: 혼합 전문가 모델을 위한 글로벌 공유 전문가 풀

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

초록

Support