即時重構專家網絡:在專家混合模型中實現持續重路由以優化線上適應性
Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models
October 16, 2025
作者: Guinan Su, Yanwu Yang, Li Shen, Lu Yin, Shiwei Liu, Jonas Geiping
cs.AI
摘要
專家混合(Mixture-of-Experts, MoE)模型通過稀疏的專家激活實現了高效的擴展,但在部署過程中由於分佈變化,常常面臨次優的路由決策問題。雖然現有的測試時適應方法可能解決這些問題,但它們主要針對密集模型,並且需要訪問外部數據,這限制了它們在MoE架構中的實際應用性。然而,我們發現,與其依賴參考數據,我們可以僅基於輸入上下文在線優化MoE專家的選擇。因此,我們提出了一種無數據、在線的測試時框架,該框架在文本生成過程中持續適應MoE的路由決策,無需外部監督或數據。我們的方法在兩個階段之間循環:在預填充階段及之後的定期間隔中,我們基於已生成的序列使用自監督優化模型的路由決策。然後,我們正常生成文本,保持修改後的路由器直到下一次適應。我們通過輕量級的加權向量實現這一點,這些向量僅更新選定層中的路由器對數,保持計算效率的同時防止過度適應。實驗結果顯示,在具有挑戰性的推理任務上,我們的方法實現了持續的性能提升,同時保持了對上下文變化的魯棒性。例如,我們的方法在HumanEval上使用OLMoE實現了5.5%的提升。此外,由於其即插即用的特性,我們的方法自然地補充了現有的測試時擴展技術,例如,在與DeepSeek-V2-Lite的自一致性結合時,實現了6%的平均增益。
English
Mixture-of-Experts (MoE) models achieve efficient scaling through sparse
expert activation, but often suffer from suboptimal routing decisions due to
distribution shifts in deployment. While existing test-time adaptation methods
could potentially address these issues, they primarily focus on dense models
and require access to external data, limiting their practical applicability to
MoE architectures. However, we find that, instead of relying on reference data,
we can optimize MoE expert selection on-the-fly based only on input context. As
such, we propose a data-free, online test-time framework that
continuously adapts MoE routing decisions during text generation without
external supervision or data. Our method cycles between two phases: During the
prefill stage, and later in regular intervals, we optimize the routing
decisions of the model using self-supervision based on the already generated
sequence. Then, we generate text as normal, maintaining the modified router
until the next adaption. We implement this through lightweight additive vectors
that only update router logits in selected layers, maintaining computational
efficiency while preventing over-adaptation. The experimental results show
consistent performance gains on challenging reasoning tasks while maintaining
robustness to context shifts. For example, our method achieves a 5.5\%
improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play
property, our method naturally complements existing test-time scaling
techniques, e.g., achieving 6\% average gains when incorporated with
self-consistency on DeepSeek-V2-Lite.