전문가 오프로딩에 적합하지 않은 모델들: 혼합 전문가 모델의 지역적 라우팅 일관성에 관하여

초록

전문가 혼합(Mixture-of-Experts, MoE)은 추론 과정에서 희소하게 활성화되는 전문가들을 통해 대규모 언어 모델(LLMs)의 효율적인 확장을 가능하게 합니다. 메모리가 제한된 장치에서 대형 MoE 모델을 효과적으로 배포하기 위해, 많은 시스템은 *전문가 오프로딩*을 도입하여 일부 전문가를 고속 메모리에 캐싱하고 나머지는 저속 메모리에 남겨 CPU에서 실행하거나 필요 시 로드합니다. 일부 연구에서는 연속적인 토큰이 유사한 전문가를 활성화하는 **로컬 라우팅 일관성**의 지역성을 활용했지만, 이러한 일관성의 정도는 모델에 따라 다양하며 아직 충분히 연구되지 않았습니다. 본 논문에서는 MoE 모델의 로컬 라우팅 일관성을 측정하기 위해 두 가지 지표를 제안합니다: (1) **세그먼트 라우팅 최적 성능(SRP)**, 이는 고정된 전문가 그룹이 토큰 세그먼트의 요구를 얼마나 잘 충족시키는지 평가하며, (2) **세그먼트 캐시 최적 적중률(SCH)**, 이는 주어진 캐시 크기 제한 하에서 최적의 세그먼트 수준 캐시 적중률을 측정합니다. 다양한 크기와 아키텍처를 가진 20개의 MoE LLM을 분석한 결과, 모든 레이어에 MoE를 적용하고 공유 전문가를 사용하지 않는 모델이 가장 높은 로컬 라우팅 일관성을 보였습니다. 또한, 도메인 특화 전문가가 어휘 특화 전문가보다 라우팅 일관성에 더 크게 기여하며, 대부분의 모델이 활성 전문가 수의 약 2배 크기의 캐시로 캐시 효과와 효율성 사이의 균형을 맞출 수 있음을 보였습니다. 이러한 발견은 추론 속도를 저하시키지 않으면서 메모리 효율적인 MoE 설계와 배포를 위한 길을 열어줍니다. 실험을 재현하기 위한 코드는 https://github.com/ljcleo/moe-lrc 에 공개하였습니다.

English

Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

전문가 오프로딩에 적합하지 않은 모델들: 혼합 전문가 모델의 지역적 라우팅 일관성에 관하여

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

초록

Support