専門家オフローディングに適さないモデルも存在する：Mixture-of-Expertモデルのローカルルーティング一貫性について

要旨

Mixture-of-Experts (MoE) は、推論時に疎に活性化されるエキスパートを活用することで、大規模言語モデル (LLMs) の効率的なスケーリングを可能にします。メモリ制約のあるデバイス上で大規模なMoEモデルを効果的に展開するため、多くのシステムでは*エキスパートオフローディング*を導入し、高速メモリに一部のエキスパートをキャッシュし、残りを低速メモリに置いてCPUで実行するか、必要に応じてロードします。これまでの研究では、連続するトークンが類似したエキスパートを活性化するという**ローカルルーティング一貫性**を活用してきましたが、この一貫性の度合いはモデルによって異なり、まだ十分に研究されていません。本論文では、MoEモデルのローカルルーティング一貫性を測定するための2つの指標を提案します：(1) **セグメントルーティング最適性能 (SRP)** は、固定されたエキスパートグループがトークンのセグメントのニーズをどれだけカバーできるかを評価し、(2) **セグメントキャッシュ最適ヒット率 (SCH)** は、与えられたキャッシュサイズ制限下での最適なセグメントレベルのキャッシュヒット率を測定します。私たちは、さまざまなサイズとアーキテクチャを持つ20のMoE LLMを分析し、すべての層にMoEを適用し、共有エキスパートを使用しないモデルが最も高いローカルルーティング一貫性を示すことを発見しました。さらに、ドメイン特化型エキスパートは語彙特化型エキスパートよりもルーティング一貫性に大きく寄与し、ほとんどのモデルがキャッシュの有効性と効率性を約2倍のアクティブエキスパートサイズでバランスを取れることを示しました。これらの発見は、推論速度を損なうことなく、メモリ効率の良いMoE設計と展開の道を開くものです。実験を再現するためのコードを https://github.com/ljcleo/moe-lrc で公開しています。

English

Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

専門家オフローディングに適さないモデルも存在する：Mixture-of-Expertモデルのローカルルーティング一貫性について

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

要旨

Support