トークンのジレンマについて：大規模視覚言語モデルの継続学習のためのドリフト対応トークン割り当てによる動的MoE

要旨

マルチモーダル連続指示チューニングは、大規模視覚言語モデル（LVLM）を、過去に獲得した知識を忘れることなく新しいデータから学習し続けることで強化することを目的としています。エキスパートの混合（MoE）アーキテクチャは、新しいエキスパートを段階的に追加し、ルーターを拡張しながら既存のものを凍結することで、このプロセスを自然に促進します。しかし、エキスパートが分離されているにもかかわらず、MoEベースの連続学習モデルは、ルーティングドリフトによる忘却に悩まされています。つまり、旧タスクのトークンが誤って新しく追加されたエキスパートに引き寄せられ、過去のタスクにおける性能が低下するのです。我々はトークンレベルでこの失敗モードを分析し、トークンのジレンマを明らかにしました。新規タスクデータに含まれる曖昧なトークンや旧タスクのトークンは、学習上の利点が最小限であるにもかかわらず、訓練中の曖昧なルーティング割り当てにより新しいエキスパートにルーティングされると、忘却を誘発することを示しました。この知見に基づき、我々はLLaVA-DyMoEを提案します。これは、ドリフトを意識したトークン割り当てによりMoEを段階的に拡張する動的MoEフレームワークです。我々は、トークンのタイプをそのルーティングスコア分布によって特徴付け、対象を絞った正則化を適用します。具体的には、トークンレベルの割り当てガイダンスにより、曖昧なトークンと旧タスクのトークンを新しいエキスパートから遠ざけ、確立されたルーティングパターンを保護してルーティングドリフトを軽減します。同時に、補完的なルーティングスコア正則化により、エキスパートグループ間の分離を強化し、新しいエキスパートの特化を促進します。大規模な実験により、我々のLLaVA-DyMoEがルーティングドリフトに起因する忘却を効果的に緩和し、ベースラインと比較して平均最終精度で7%以上の向上、忘却率で12%の削減を達成することを実証しました。プロジェクトページは https://zhaoc5.github.io/DyMoE です。

English

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

トークンのジレンマについて：大規模視覚言語モデルの継続学習のためのドリフト対応トークン割り当てによる動的MoE

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

要旨

Support