在大型语言模型预训练中何处觅得“洞见”？无需测试，监控记忆至泛化的转变

摘要

Grokking現象，即訓練損失收斂後測試性能仍持續提升，近期在神經網絡訓練中被觀察到，這使得泛化機制及其他新興能力如推理變得神秘莫測。以往研究通常針對小型模型在少數玩具或高度特定任務上進行數千輪訓練，而我們首次在一個7B大規模語言模型（LLM），即OLMoE，的一次性預訓練過程中，對檢查點上的grokking現象進行了研究。我們計算了訓練損失，並在多樣化的基準任務上評估了泛化能力，包括數學推理、代碼生成以及常識/領域特定知識檢索任務。我們的研究首次證實，儘管不同數據可能異步進入grokking階段，但在大規模基礎模型的預訓練中，grokking現象依然存在。通過探究LLM內部動態，我們進一步揭示了grokking中“泛化湧現”的奧秘。具體而言，我們發現訓練樣本的路徑（即跨層次的專家選擇）在grokking過程中從隨機、實例特定演變為更加結構化且樣本間可共享。此外，儘管損失已收斂，樣本路徑的複雜度卻有所降低。這些發現指向了從記憶到泛化的轉變，為延遲泛化提供了機制上的解釋。在本研究中，我們開發了兩個新穎的指標來量化路徑距離及單一路徑的複雜度，並展示了它們在預測多樣化下游任務上泛化提升的能力。這些指標高效、易於計算且僅依賴於訓練數據，因此對於預訓練具有實用價值，使我們無需微調和測試即可監控泛化性能。理論上，我們證明了更結構化的路徑能降低模型複雜度並提升泛化界限。

English

Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking's "emergence of generalization" by investigating LLM internal dynamics. Specifically, we find that training samples' pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples during grokking. Also, the complexity of a sample's pathway reduces despite the converged loss. These indicate a memorization-to-generalization conversion, providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound.

在大型语言模型预训练中何处觅得“洞见”？无需测试，监控记忆至泛化的转变

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

摘要

Support