なぜより大きなモデルはより多く学習するのか：容量、干渉、および稀なタスク保持の影響

要旨

大規模モデルは小規模モデルでは学習できないタスクを学習する。この現象の原動力は何か？我々は、べき乗則スケーリングが既に示唆している単純な現象論的議論を展開する。すなわち、大規模モデルは、無限の訓練データがあっても小規模モデルが学習できないデータ分布の一部を学習できるというものである。この主張を検証しその原因を特定するために、単調なスケーリング曲線を示すタスクの混合からなる合成設定においてモデルスケーリングの影響を調べる。その結果は、データに起因するリソース（ニューロン）をめぐる競争を示唆している。具体的には、小規模モデルはニューロンを高頻度または低複雑性のタスクに割り当てるため、稀で複雑なタスクに対してパフォーマンスの低い解を学習する。さらに、これは望ましいタスクを表現できる解が存在する場合でも発生する。次に、大規模モデルがこのデータ中心のボトルネックをどのように回避するかを評価し、それが干渉低減メカニズムに起因することを見出した。大規模モデルは一般的なタスクに十分なリソースを割り当てられるため、それらのタスクに対する勾配更新が弱くなり、結果として稀なタスクの特徴がゆっくりと蓄積されていく間に上書きされることがない。最後に、これらの主張をさらに検証するために、OLMoモデル（4Mから4Bパラメータ）を、頻度と複雑性が異なる新しいタスクで事前学習する。その結果は合成データ実験の結果を反映しており、大規模なOLMoモデルのみが低頻度かつ複雑なタスクを学習し、これらの大規模モデルは表現により多くのタスク特徴を埋め込み、タスク間の勾配干渉が少ないことを示している。全体として、我々はなぜ大規模モデルが小規模モデルでは学習できないタスクを学習するのかについて、データ中心の説明を提供する。これは、実務において大規模モデルが優れている理由を説明する助けとなり、モデルサイジングや訓練データの混合に関する実践的な問いに示唆を与えることができる。

English

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.