FineRMoE: 차원 확장을 통한 세분화된 전문가 구성 및 업사이클링 접근법

초록

세분화된 MoE의 스케일링 법칙에서 드러났듯이, 중간 차원의 세분화 정도가 최적 임계값을 초과하면 모델 성능 향상이 정체되어 단일 차원 세분화 설계의 추가 이득이 제한됩니다. 이러한 병목 현상을 해결하기 위해 우리는 중간 차원과 출력 차원 모두에 세분화된 전문가 설계를 확장하여 단일 차원 한계를 넘어 전문가 특화를 강화하는 FineRMoE(FineR-Grained MoE) 아키텍처를 제안합니다. 또한 활성화를 제어하기 위한 이중 수준 희소 순전파 계산 패러다임과 특화된 라우팅 메커니즘을 도입합니다. 더불어 FineRMoE를 처음부터 훈련하는 데 따르는 과도한 비용을 방지하기 위해, 비용 효율적인 방식으로 FineRMoE를 구축할 수 있는 일반화된 업사이클링 방법을 고안했습니다. 대규모 실험을 통해 FineRMoE가 10개의 표준 벤치마크에서 달성한 우수한 성능을 입증했습니다. 가장 강력한 베이스라인 대비 FineRMoE는 6배 높은 매개변수 효율, 281배 낮은 프리필 지연 시간, 그리고 추론 시 136배 높은 디코딩 처리량을 달성했습니다.

English

As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.

FineRMoE: 차원 확장을 통한 세분화된 전문가 구성 및 업사이클링 접근법

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

초록

Support