ChatPaper.aiChatPaper

FineRMoE:基于维度扩展的细粒度专家模型及其升级循环方法

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

March 9, 2026
作者: Ning Liao, Xiaoxing Wang, Xiaohan Qin, Junchi Yan
cs.AI

摘要

正如细粒度MoE的缩放定律所揭示,当中间维度粒度超过最优阈值后,模型性能便无法继续提升,这限制了单维度细粒度设计的增益空间。为突破此瓶颈,我们提出FineRMoE(双维度细粒度MoE)架构,将细粒度专家设计扩展至中间维度和输出维度,旨在突破单维度限制提升专家专业化程度。我们进一步引入双层级稀疏前向计算范式与专用路由机制来控制激活状态。此外,为避免从头训练FineRMoE的巨额成本,我们设计了一种广义升级再造方法,以经济高效的方式构建FineRMoE。大量实验表明,FineRMoE在十项标准基准测试中均实现了卓越性能:相较于最强基线模型,FineRMoE在推理时实现了6倍的参数效率提升、281倍的前向计算延迟降低以及136倍的解码吞吐量提升。
English
As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.
PDF92March 18, 2026