ChatPaper.aiChatPaper

论令牌困境:基于漂移感知令牌分配的动态MoE方法助力大规模视觉语言模型持续学习

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

March 29, 2026
作者: Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
cs.AI

摘要

多模态持续指令调优旨在通过不断学习新数据而不遗忘已掌握知识,来持续增强大型视觉语言模型。专家混合架构通过增量添加新专家并扩展路由网络,同时冻结现有组件,天然适配这一目标。然而尽管存在专家隔离机制,基于MoE的持续学习模型仍因路由漂移现象遭受遗忘:旧任务令牌被错误吸引至新增专家,导致先前任务性能下降。我们在令牌层面分析这一失效模式,揭示了令牌困境:新任务数据中的模糊令牌和旧令牌学习收益微乎其微,却在训练期间因路由分配模糊性被导向新专家,从而引发遗忘。基于此,我们提出LLaVA-DyMoE——一种具备漂移感知令牌分配能力的动态MoE框架。我们通过路由分数分布表征令牌类型,并实施针对性正则化:令牌级分配引导将模糊令牌和旧令牌从新专家处剥离,以保持既定路由模式并缓解路由漂移;同时辅以路由分数正则化,强制专家组分离并促进新专家专业化。大量实验表明,LLaVA-DyMoE能有效缓解路由漂移引发的遗忘,相比基线模型在平均最终准确率上提升超7%,遗忘率降低12%。项目主页详见https://zhaoc5.github.io/DyMoE。
English
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.
PDF271April 1, 2026