為何更大的模型學習更多：容量、干擾與稀有任務保留的影響

摘要

較大的模型能學會較小的模型無法學會的任務。是什麼驅動了這個現象？我們提出一個簡單的現象學論證：即使擁有無限的訓練數據，冪律縮放本質上就已暗示，較大的模型能夠學會數據分佈中較小模型無法學會的部分。為了驗證此主張並找出其成因，我們研究模型縮放對一個由多項任務組成的合成設定之影響，這些任務呈現出單調的縮放曲線。結果指向一種由數據引發的資源（神經元）競爭。具體來說，較小的模型會將其神經元分配給高頻率或低複雜度的任務，因此它們學到的解決方案在罕見且複雜的任務上表現不佳。此外，即使存在能夠表達所需任務的解決方案，這種情況仍會發生。接著，我們評估較大的模型如何繞過這個以數據為中心的瓶頸，發現這源於一種減弱的干擾機制：較大的模型能為常見任務分配足夠的資源，使得這些任務的梯度更新變弱，這意味著它們在罕見任務特徵緩慢累積的過程中，不會將其覆蓋。最後，為了進一步驗證這些主張，我們在頻率和複雜度各異的新任務上預訓練了OLMo模型（參數量從4M到4B）。結果與合成數據實驗的結果一致：只有較大的OLMo模型學會了不常見且複雜的任務，而這些較大的模型在其表徵中嵌入了更多任務特徵，並且任務間的梯度干擾較少。總體而言，我們提出了一個以數據為中心的解釋，說明為何較大的模型能學會較小模型無法學會的任務。這有助於解釋為何較大的模型在實務上表現更好，也能為關於模型規模設定與訓練數據組合的實際問題提供參考。

English

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.