더 큰 모델이 더 많이 학습하는 이유: 용량, 간섭 및 희귀 과제 유지의 효과

초록

더 큰 모델은 더 작은 모델이 학습하지 못하는 작업을 학습한다. 이 현상의 원인은 무엇인가? 우리는 멱법칙 스케일링이 이미 더 큰 모델이 무한한 훈련 데이터를 사용하더라도 더 작은 모델이 학습하지 못하는 데이터 분포의 일부를 학습할 수 있음을 시사한다는 단순한 현상학적 논증을 제시한다. 이 주장을 검증하고 그 원인을 파악하기 위해, 단조로운 스케일링 곡선을 보이는 작업들의 혼합으로 구성된 합성 설정에서 모델 스케일링의 효과를 연구한다. 결과는 데이터 유발 자원(뉴런) 경쟁을 지적한다. 구체적으로, 더 작은 모델은 뉴런을 높은 빈도 또는 낮은 복잡성의 작업에 할당하므로, 드물고 복잡한 작업에 대해 성능이 낮은 해결책을 학습한다. 더욱이, 이는 원하는 작업을 표현할 수 있는 해결책이 존재하는 경우에도 발생한다. 그런 다음 더 큰 모델이 이러한 데이터 중심 병목 현상을 어떻게 극복하는지 평가하며, 이것이 감소된 간섭 메커니즘에 기인함을 발견한다: 더 큰 모델은 일반적인 작업에 충분한 자원을 할당하여 해당 작업에 대한 기울기 업데이트가 약해지며, 이는 드문 작업의 특징이 천천히 축적될 때 덮어쓰지 않음을 의미한다. 마지막으로, 이 주장들을 추가로 검증하기 위해 다양한 빈도와 복잡성을 가진 새로운 작업에 대해 OLMo 모델(400만 ~ 40억 파라미터)을 사전 훈련한다. 결과는 합성 데이터 실험 결과와 일치한다: 더 큰 OLMo 모델만이 드물고 복잡한 작업을 학습하며, 이러한 더 큰 모델은 표현에 더 많은 작업 특징을 내장하고 작업 간 기울기 간섭이 적게 나타난다. 전반적으로, 우리는 더 큰 모델이 더 작은 모델이 학습하지 못하는 작업을 학습하는 이유에 대한 데이터 중심 설명을 제공한다. 이는 실제로 더 큰 모델이 더 나은 이유를 설명하는 데 도움이 되며, 모델 크기와 훈련 데이터 혼합에 관한 실용적인 질문에 정보를 제공할 수 있다.

English

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.