为什么更大的模型学得更多：容量、干扰与稀有任务保留的影响

摘要

更大的模型能学会较小模型无法掌握的任务。是什么驱动了这一现象？我们提出了一个简单的现象学论证：幂律缩放本身已经表明，即使拥有无限训练数据，更大的模型也能学习到较小模型无法学到的部分数据分布。为了验证这一观点并找出其成因，我们研究了模型缩放对合成设置的影响，该设置由一组呈现单调缩放曲线的任务混合而成。结果指向了一种数据引发的资源（神经元）竞争。具体而言，较小的模型会将其神经元分配给高频或低复杂度的任务，从而导致它们学到的解决方案在罕见和复杂任务上表现不佳。而且，即使存在能够表达所需任务的解决方案，这种情况仍会发生。随后，我们评估了更大的模型如何规避这种以数据为中心的瓶颈，发现这源于一种减弱的干扰机制：更大的模型能为常见任务分配足够多的资源，使得这些任务的梯度更新变得微弱，从而在罕见任务特征缓慢积累时不会将其覆盖。最后，为了进一步验证这些论断，我们在不同频率和复杂度的新任务上预训练了OLMo模型（参数规模从400万到40亿）。结果与我们在合成数据实验中的发现一致：只有更大的OLMo模型能学会那些低频且复杂的任务，并且这些更大的模型在其表示中嵌入了更多的任务特征，同时任务间的梯度干扰更小。总体而言，我们从数据为中心的角度解释了为什么更大的模型能学会较小模型无法掌握的任务。这有助于理解为什么在实际应用中更大的模型表现更好，并为模型规模选择和训练数据配比等实践问题提供参考。

English

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.