スケールでのモデル統合において重要な要素は何ですか？

要旨

モデルの統合は、複数の専門家モデルをより能力の高い単一のモデルに組み合わせることを目指し、ストレージおよびサービングコストの削減、汎化の向上、分散型モデル開発のサポートなどの利点を提供します。その有望さにもかかわらず、これまでの研究は主にいくつかの小さなモデルを統合することに焦点を当ててきました。これにより、モデルサイズのスケーリングの影響や、ベースモデルの品質や専門家モデルの数などの他の重要な要因との相互作用が、統合されたモデルのパフォーマンスにどのように影響するかという多くの未解決の問題が残されています。この研究では、モデルの統合の有用性を規模に合わせて体系的に評価し、これらの異なる要因の影響を調査しています。我々は、1Bから64Bのパラメータを持つ完全にファインチューニングされたモデルを用いて、4つの人気のある統合方法（平均化、タスク算術、Dare、TIES）を用いて、最大8つの異なる専門家モデルを統合する実験を行いました。我々は、専門家のトレーニングタスクである保持されたタスクと、未知の保持されたタスクへのゼロショット汎化の両方で統合されたモデルを評価しました。我々の実験は、規模におけるモデルの統合に関するいくつかの新しい知見と、異なる要因との相互作用について明らかにしています。第一に、専門家が強力なベースモデルから作成された場合、統合がより効果的であることがわかりました。第二に、より大きなモデルは統合を容易にします。第三に、統合は一貫して汎化能力を向上させます。特に、8つの大規模な専門家モデルを統合すると、統合されたモデルはしばしばマルチタスクトレーニングされたモデルと比較して汎化性能が向上します。第四に、より大きなモデルを使用すると、より多くの専門家モデルをより良く統合することができます。第五に、異なる統合方法は、より大規模なスケールで非常に似たように振る舞います。総じて、我々の研究結果は、モデルの統合のいくつかの興味深い特性を明らかにするとともに、いくつかの制限事項を強調しています。この研究が今後の研究における大規模統合の参考点となることを期待しています。

English

Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

スケールでのモデル統合において重要な要素は何ですか？

What Matters for Model Merging at Scale?

要旨

Support