PreMoe: 전문가 가지치기와 검색을 통한 제한된 메모리에서의 MoE 경량화

초록

전문가 혼합(Mixture-of-experts, MoE) 아키텍처는 대규모 언어 모델(LLMs)을 방대한 매개변수 수준으로 확장하면서도 계산 비용의 비례적 증가 없이 이를 가능하게 한다. 그러나 대형 MoE 모델의 상당한 메모리 요구 사항은 클라우드 서버부터 소비자 기기까지 다양한 계산 환경에서의 배포를 방해한다. 본 연구는 먼저 MoE 레이어 내 전문가 활성화 패턴에서 두드러진 작업 특화 현상을 입증한다. 이를 바탕으로, 메모리가 제한된 환경에서 대규모 MoE 모델의 효율적 배포를 가능하게 하는 새로운 프레임워크인 PreMoe를 소개한다. PreMoe는 두 가지 주요 구성 요소를 특징으로 한다: 확률적 전문가 가지치기(Probabilistic Expert Pruning, PEP)와 작업 적응형 전문가 검색(Task-Adaptive Expert Retrieval, TAER). PEP는 라우터 로짓에서 도출된 작업 조건부 기대 선택 점수(Task-Conditioned Expected Selection Score, TCESS)라는 새로운 메트릭을 사용하여 특정 작업에 대한 전문가 중요도를 정량화함으로써, 최소한의 핵심 전문가 집합을 식별한다. TAER는 이러한 작업 특화 전문가 중요도 프로파일을 활용하여 효율적인 추론을 가능하게 한다. TAER는 다양한 작업에 대한 컴팩트한 전문가 패턴을 사전 계산 및 저장한다. 사용자 쿼리가 수신되면, TAER는 가장 관련성이 높은 저장된 작업 패턴을 신속히 식별하고, 해당 작업에 중요한 소규모 전문가 하위 집합만을 로드하여 모델을 재구성한다. 이 접근 방식은 모든 배포 시나리오에서 메모리 사용량을 극적으로 줄인다. DeepSeek-R1 671B는 MATH500에서 8/128 구성(50% 전문가 감소)으로 가지치기했을 때 97.2%의 정확도를 유지하며, 더 공격적인 8/32 가지치기(87.5% 전문가 감소)에서도 72.0%의 정확도를 달성한다. Pangu-Ultra-MoE 718B는 MATH500에서 8/128 가지치기로 97.15%, AIME24에서 81.3%의 정확도를 달성하며, 더 공격적인 4/64 가지치기(390GB 메모리)에서도 MATH500에서 96.95%의 정확도를 유지한다. 본 연구의 코드는 https://github.com/JarvisPei/PreMoe에서 공개적으로 이용 가능하다.

English

Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.

PreMoe: 전문가 가지치기와 검색을 통한 제한된 메모리에서의 MoE 경량화

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

초록

Support