時間的に拡張されたエキスパートの混合モデル

要旨

Mixture-of-Experts（MoE）モデルは、推論速度を固定したまま容量をスケーリングする手法として現在広く利用されており、ほぼすべてのトークンで専門家（エキスパート）を切り替える。しかし、モデルが利用可能なGPUメモリを超過すると、この頻繁な切り替えにより、オフローディングやプリフェッチなどの最適化が効果を失う。我々は、強化学習におけるオプションフレームワークがこの問題に理想的に対応できると主張し、時間的に拡張されたMixture-of-Experts層を提案する。熟慮コストを導入したオプションクリティックフレームワークを基盤とし、各層にコントローラを追加して、専門家セットの切り替えタイミングとロード対象を学習させる。これを低ランクアダプターを適用したgpt-oss-20bモデルと自己蒸留報酬を用いて検証した結果、MATH、MMLU、MMMLUにおけるベースモデル精度の最大90%を維持しつつ、切り替え率を50%以上から5%未満に低減することに成功した。これは、既存の事前学習済みモデルでも軽量な訓練により時間拡張MoEへ変換可能であることを示し、熟慮コストを通じてモデル開発者が切り替え頻度と能力をトレードオフできることを実証する。本手法が、オプションフレームワークに基づく原則的なアプローチとして、拡大を続けるMoEモデルにおけるメモリ効率の良い推論サービスと継続学習の道を開くことを期待する。

English

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.

時間的に拡張されたエキスパートの混合モデル

Temporally Extended Mixture-of-Experts Models

要旨

Support