使えない場合はリサイクルする：スケールでのマージの最適化パフォーマンスのトレードオフを緩和

要旨

モデルの統合は、専門家モデルを組み合わせる際に大きな可能性を示していますが、多くのタスクでトレーニングされた「汎用」モデルを統合する際の利点は不明です。我々は、大規模な（約100Bの）モデルの文脈で統合を探求し、さまざまなタスク間でトレードオフを示すチェックポイントを再利用します。このようなチェックポイントは、フロンティアモデルの開発過程で作成されることが多く、多くのサブ最適なものは通常破棄されます。異なるトレーニングラン（例：異なる段階、目標、ハイパーパラメータ、データの組み合わせ）から得られたモデルのチェックポイントのプールが与えられた場合、これらは通常、異なる言語能力（例：命令の従うこと vs. コード生成）にわたるトレードオフを自然に示します。我々は、このようなサブ最適なモデルをペアレト最適なモデルに再利用できるかどうかを調査します。最適化アルゴリズムは、各チェックポイントの重みを線形結合で調整し、個々のモデルや統合ベースラインを上回るペアレト最適なモデルを生み出します。さらなる分析では、良い統合は、非ゼロの重みを持つほとんどすべてのチェックポイントを含む傾向があり、見かけ上悪い初期チェックポイントでも良い最終的な統合に貢献できることを示しています。

English

Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging ``generalist'' models trained on many tasks. We explore merging in the context of large (sim100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and many suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in a Pareto-optimal models that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.

使えない場合はリサイクルする：スケールでのマージの最適化パフォーマンスのトレードオフを緩和

If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

要旨

Support