EMO: 출현적 모듈성을 위한 전문가 혼합 사전 학습

초록

대규모 언어 모델은 일반적으로 모놀리식 시스템으로 배포되며, 애플리케이션이 코드, 수학 또는 도메인 특화 지식과 같은 특정 기능의 일부만 필요로 하는 경우에도 전체 모델을 필요로 합니다. 전문가 혼합(MoE) 모델은 입력당 전문가의 일부만 활성화하는 대안처럼 보이지만, 실제로 특정 도메인에 대해 전문가 하위 집합만으로 추론을 제한하면 심각한 성능 저하가 발생합니다. 이는 모델이 더 크고 희소해짐에 따라 메모리가 제한된 환경에서의 실용성을 제한합니다. 우리는 인간이 정의한 사전 지식 없이도 모듈성, 즉 전문가 하위 집합의 독립적 사용 및 구성을 위해 설계된 MoE인 EMO를 소개합니다. 우리의 핵심 아이디어는 유사한 도메인의 토큰이 유사한 전문가에 의존하도록 유도하는 것입니다. 문서 내 토큰들은 동일한 도메인을 공유하는 경우가 많기 때문에, EMO는 이들이 공유 풀에서 전문가를 선택하도록 제한하는 동시에 다른 문서들이 서로 다른 풀을 사용할 수 있도록 합니다. 이 간단한 제약만으로도 사전 학습 시 문서 경계만을 사용하여 응집력 있는 전문가 그룹이 형성됩니다. 우리는 1T 토큰으로 10억 개 활성, 140억 개 전체 파라미터 규모의 EMO를 사전 학습했습니다. 전체 모델로서는 표준 MoE 성능에 필적합니다. 중요한 것은 선택적 전문가 사용이 가능하다는 점으로, 전문가의 25%(12.5%)만 유지해도 절대 성능이 1%(3%)만 하락하는 반면, 동일한 설정에서 표준 MoE는 작동 불능 상태가 됩니다. 또한 우리는 EMO의 전문가 하위 집합이 표준 MoE에서 관찰되는 저수준 구문 특화와 달리 수학이나 코드와 같은 도메인 수준의 의미론적 특화를 보인다는 사실을 발견했습니다. 전반적으로 우리의 결과는 대규모 희소 모델의 모듈적이고 메모리 효율적인 배포를 위한 길을 제시하며, 구성 가능 아키텍처에 새로운 기회를 열어줍니다.

English

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.

EMO: 출현적 모듈성을 위한 전문가 혼합 사전 학습

EMO: Pretraining Mixture of Experts for Emergent Modularity

초록

Support