토큰의 딜레마에 관하여: 대규모 시각-언어 모델의 지속적 학습을 위한 동적 MoE 및 표류 인식 토큰 할당

초록

다중모드 지속 명령어 튜닝은 대규모 시각 언어 모델(LVLM)이 기존에 습득한 지식을 잃지 않으면서 새로운 데이터로부터 지속적으로 학습하여 성능을 향상시키는 것을 목표로 합니다. 전문가 혼합(MoE) 아키텍처는 기존 전문가를 동결한 상태로 새로운 전문가를 점진적으로 추가하고 라우터를 확장함으로써 이 목표를 자연스럽게 지원합니다. 그러나 전문가 격리에도 불구하고, MoE 기반 지속 학습 모델은 라우팅 드리프트로 인한 망각 현상을 겪습니다. 즉, 이전 작업의 토큰들이 새로 추가된 전문가로 오인되어 유입되면서 기존 작업에 대한 성능이 저하됩니다. 우리는 토큰 수준에서 이 실패 모드를 분석하고 토큰의 딜레마를 밝혔습니다: 새로운 작업 데이터에 포함된 모호한 토큰과 이전 작업 토큰은 학습 이득이 거의 없음에도 불구하고, 훈련 중 모호한 라우팅 할당으로 인해 새로운 전문가로 배정될 때 망각을 유발합니다. 이에 동기를 받아 우리는 드리프트 인식 토큰 할당으로 MoE를 점진적으로 확장하는 동적 MoE 프레임워크인 LLaVA-DyMoE를 제안합니다. 우리는 토큰 유형을 라우팅 점수 분포를 통해 특징화하고 대상별 정규화를 적용합니다. 구체적으로, 토큰 수준 할당 가이던스는 모호한 토큰과 이전 작업 토큰이 새로운 전문가로 유입되는 것을 차단하여 확립된 라우팅 패턴을 보존하고 라우팅 드리프트를 완화합니다. 동시에 보완적인 라우팅 점수 정규화는 전문가 그룹 간 분리를 강화하고 새로운 전문가의 전문화를 촉진합니다. 폭넓은 실험을 통해 우리의 LLaVA-DyMoE가 라우팅 드리프트로 인한 망각을 효과적으로 완화하며, 기준 모델 대비 평균 최종 정확도에서 7% 이상 향상되고 망각률이 12% 감소함을 입증했습니다. 프로젝트 페이지는 https://zhaoc5.github.io/DyMoE 입니다.

English

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

토큰의 딜레마에 관하여: 대규모 시각-언어 모델의 지속적 학습을 위한 동적 MoE 및 표류 인식 토큰 할당

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

초록

Support