大規模マルチモーダルモデルに新しいスキルを教える方法

要旨

大規模マルチモーダルモデル（LMM）に新しいスキルを教えつつ、既存の能力を失わないようにするにはどうすればよいか？我々は、3つのモデルファミリーにおいて、5つのターゲットスキルに対する逐次ファインチューニングを研究し、8つの保留ベンチマークでの一般能力を監視した。狭い範囲でのファインチューニング後に保留タスクで見られる「忘却」が、後の段階で部分的に回復することを観察した。この挙動は、出力トークン分布の測定可能なシフトに起因し、忘却と共変する単純なカウントバイアスプローブを通じて現れる。この知見を基に、ドリフトを抑えつつ強力に学習する2つのシンプルで堅牢なチューニングレシピを特定した：(i) 自己注意投影層のみを更新する、(ii) MLPのGate&Upのみを更新し、Down投影を凍結する。モデルとタスク全体で、これらの選択肢はターゲットの大幅な向上をもたらしつつ、保留性能をほぼ維持する。コードはhttps://github.com/jessemelpolio/LMM_CLで公開されている。

English

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

大規模マルチモーダルモデルに新しいスキルを教える方法

How to Teach Large Multimodal Models New Skills

要旨

Support