並非所有模型都適合專家卸載：論混合專家模型的本地路由一致性

摘要

混合專家模型（Mixture-of-Experts, MoE）能夠在推理過程中通過稀疏激活專家來高效擴展大規模語言模型（Large Language Models, LLMs）。為了在記憶體受限的設備上有效部署大型MoE模型，許多系統引入了*專家卸載*技術，將一部分專家緩存在快速記憶體中，而將其他專家留在慢速記憶體中，以便在CPU上運行或按需加載。雖然一些研究已經利用了專家激活的局部性，即連續的詞元會激活相似的專家，但這種**局部路由一致性**的程度因模型而異，並且尚未得到充分研究。在本文中，我們提出了兩個度量指標來衡量MoE模型的局部路由一致性：(1) **分段路由最佳性能（Segment Routing Best Performance, SRP）**，評估固定專家組如何滿足一段詞元的需求；(2) **分段緩存最佳命中率（Segment Cache Best Hit Rate, SCH）**，衡量在給定緩存大小限制下的最佳分段級緩存命中率。我們分析了20個不同規模和架構的MoE LLM，發現那些在每一層都應用MoE且不使用共享專家的模型表現出最高的局部路由一致性。我們進一步表明，領域專家的貢獻大於詞彙專家，並且大多數模型可以在緩存大小約為激活專家數量的2倍時，平衡緩存的有效性和效率。這些發現為在不影響推理速度的情況下實現記憶體高效的MoE設計和部署鋪平了道路。我們在https://github.com/ljcleo/moe-lrc 上發布了用於重複實驗的代碼。

English

Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

並非所有模型都適合專家卸載：論混合專家模型的本地路由一致性

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

摘要

Support