ChatPaper.aiChatPaper

在规模上合并模型时有哪些要点?

What Matters for Model Merging at Scale?

October 4, 2024
作者: Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, Tsendsuren Munkhdalai
cs.AI

摘要

模型合并旨在将多个专家模型合并为一个更强大的单一模型,提供诸如减少存储和服务成本、改善泛化能力以及支持分散式模型开发等优点。尽管具有潜力,先前的研究主要集中在合并少量小型模型上。这导致了许多关于扩展模型规模的影响以及与其他关键因素(如基础模型质量和专家模型数量)的相互作用如何影响合并模型性能的问题尚未解答。本研究系统地评估了规模化模型合并的效用,考察了这些不同因素的影响。我们尝试使用4种流行的合并方法(平均、任务算术、Dare和TIES)合并完全微调的模型,涵盖了从10亿到640亿参数的模型规模,并合并了多达8个不同的专家模型。我们在专家的训练任务和对未见过的任务进行零样本泛化的情况下评估了合并模型。我们的实验为规模化模型合并和不同因素之间的相互作用提供了一些新的见解。首先,我们发现,当专家模型基于具有良好零样本性能的强基础模型时,合并效果更好。其次,更大的模型有助于更容易地进行合并。第三,合并一致地提高了泛化能力。值得注意的是,当合并8个大型专家模型时,合并模型通常比多任务训练的模型具有更好的泛化能力。第四,在处理更大的模型时,我们可以更好地合并更多的专家模型。第五,不同的合并方法在更大规模下的行为非常相似。总的来说,我们的研究结果揭示了模型合并的一些有趣特性,同时也突出了一些局限性。我们希望这项研究能成为未来研究中关于大规模合并的参考点。
English
Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

Summary

AI-Generated Summary

PDF82November 16, 2024