并非所有模型都适合专家卸载:论混合专家模型的本地路由一致性
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
May 21, 2025
作者: Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
cs.AI
摘要
专家混合模型(Mixture-of-Experts, MoE)通过推理过程中稀疏激活的专家,实现了大型语言模型(LLMs)的高效扩展。为了在内存受限的设备上有效部署大型MoE模型,许多系统引入了*专家卸载*机制,将一部分专家缓存于快速内存中,而将其他专家保留在慢速内存中,以便在CPU上运行或按需加载。尽管已有研究利用了专家激活的局部性,即连续令牌倾向于激活相似的专家,但这种**局部路由一致性**的程度因模型而异,且尚未得到充分研究。本文提出了两种衡量MoE模型局部路由一致性的指标:(1) **段路由最佳性能(SRP)**,评估固定专家组对一段令牌需求的覆盖能力;(2) **段缓存最佳命中率(SCH)**,衡量在给定缓存大小限制下,段级缓存的最优命中率。我们分析了20个不同规模和架构的MoE LLMs,发现那些在每一层应用MoE且不使用共享专家的模型展现出最高的局部路由一致性。进一步研究表明,领域专业化专家比词汇专业化专家对路由一致性的贡献更大,且大多数模型能在缓存大小约为活跃专家数量两倍时,在缓存效果与效率之间取得平衡。这些发现为在不牺牲推理速度的前提下,设计并部署内存高效的MoE模型铺平了道路。我们公开了实验复现代码,详见https://github.com/ljcleo/moe-lrc。
English
Mixture-of-Experts (MoE) enables efficient scaling of large language models
(LLMs) with sparsely activated experts during inference. To effectively deploy
large MoE models on memory-constrained devices, many systems introduce *expert
offloading* that caches a subset of experts in fast memory, leaving others on
slow memory to run on CPU or load on demand. While some research has exploited
the locality of expert activations, where consecutive tokens activate similar
experts, the degree of this **local routing consistency** varies across models
and remains understudied. In this paper, we propose two metrics to measure
local routing consistency of MoE models: (1) **Segment Routing Best Performance
(SRP)**, which evaluates how well a fixed group of experts can cover the needs
of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which
measures the optimal segment-level cache hit rate under a given cache size
limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found
that models that apply MoE on every layer and do not use shared experts exhibit
the highest local routing consistency. We further showed that
domain-specialized experts contribute more to routing consistency than
vocabulary-specialized ones, and that most models can balance between cache
effectiveness and efficiency with cache sizes approximately 2x the active
experts. These findings pave the way for memory-efficient MoE design and
deployment without compromising inference speed. We publish the code for
replicating experiments at https://github.com/ljcleo/moe-lrc .Summary
AI-Generated Summary