論權衡之困:基於漂移感知權杖分配之動態專家混合模型,助力大型視覺語言模型持續學習
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
March 29, 2026
作者: Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
cs.AI
摘要
多模態持續指令調校旨在透過持續學習新資料而不遺忘既有知識,來強化大型視覺語言模型。專家混合架構透過增量添加新專家並擴展路由器的同時凍結現有組件,自然契合此目標。然而儘管存在專家隔離機制,基於MoE的持續學習模型仍因路由偏移而遭受遺忘問題:舊任務的詞元被錯誤吸引至新增專家,導致先前任務性能衰退。我們在詞元層級分析此失效模式,揭示詞元的兩難困境:新任務資料中的模糊詞元與舊詞元在訓練期間因路由分配不明確,被導向新專家時不僅學習效益微弱,更會誘發遺忘。據此,我們提出LLaVA-DyMoE——一種具備偏移感知詞元分配機制的動態MoE框架。我們透過路由分數分佈表徵詞元類型,並實施針對性正則化:詞元層級的分配指引將模糊詞元與舊詞元疏導遠離新專家,以維護既存路由模式並緩解路由偏移;同時輔以路由分數正則化,強化專家群組區隔並促進新專家專業化。大量實驗表明,LLaVA-DyMoE能有效抑制路由偏移導致的遺忘,相較基準方法平均最終準確率提升逾7%,遺忘率降低12%。專案頁面請見 https://zhaoc5.github.io/DyMoE。
English
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.