如何教授大型多模态模型新技能
How to Teach Large Multimodal Models New Skills
October 9, 2025
作者: Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem
cs.AI
摘要
如何在不抹除先前能力的前提下,教导大型多模态模型(LMMs)新技能?我们研究了在五种目标技能上的顺序微调,同时监控三个模型系列在八个保留基准上的通用能力。我们观察到,在针对特定任务进行窄化微调后,保留任务上表现出的“遗忘”现象在后续阶段可部分恢复。我们将这一行为归因于输出令牌分布的可测量变化,这一变化通过一个与遗忘共变的简单计数偏差探针得以显现。基于这一观察,我们识别出两种简单且稳健的调优策略,它们在学习新技能的同时有效限制了模型性能的漂移:(i)仅更新自注意力投影层,以及(ii)仅更新多层感知器(MLP)的Gate&Up部分,同时冻结Down投影。在跨模型和跨任务的实验中,这些选择在显著提升目标技能表现的同时,很大程度上保留了模型在保留任务上的性能。相关代码已发布于https://github.com/jessemelpolio/LMM_CL。
English
How can we teach large multimodal models (LMMs) new skills without erasing
prior abilities? We study sequential fine-tuning on five target skills while
monitoring general ability on eight held-out benchmarks across three model
families. We observe that apparent "forgetting" on held-out tasks after narrow
fine-tuning can partly recover at later stages. We trace this behavior to a
measurable shift in the output token distribution, manifested through a simple
counting-bias probe that co-varies with forgetting. Guided by this picture, we
identify two simple, robust tuning recipes that learn strongly while limiting
drift: (i) updating only the self-attention projection layers, and (ii)
updating only the MLP Gate&Up while freezing the Down projection. Across models
and tasks, these choices deliver strong target gains while largely preserving
held-out performance. Code is available at
https://github.com/jessemelpolio/LMM_CL