在LLM预训练中何处觅得“顿悟”?无需测试即可监控记忆到泛化的转变
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
June 26, 2025
作者: Ziyue Li, Chenrui Fan, Tianyi Zhou
cs.AI
摘要
在神经网络训练中,近期观察到的“顿悟”现象(即训练损失收敛后,测试性能仍持续提升)使得泛化机制及其他新兴能力(如推理)变得神秘莫测。以往研究通常针对少量玩具任务或高度特定任务训练小型模型数千个周期,而我们的研究首次在7B大规模语言模型(LLM)——OLMoE的单次预训练过程中,对检查点上的“顿悟”现象进行了探究。我们计算了训练损失,并在多样化的基准任务上评估了泛化能力,包括数学推理、代码生成以及常识/领域知识检索任务。
我们的研究首次证实,在大规模基础模型的预训练中,“顿悟”现象依然存在,尽管不同数据可能异步进入“顿悟”阶段。通过深入探究LLM的内部动态,我们进一步揭示了“顿悟”中“泛化涌现”的奥秘。具体而言,我们发现训练样本的路径(即跨层的专家选择)在“顿悟”过程中从随机、实例特定逐渐演变为更加结构化且样本间可共享。同时,尽管损失已收敛,样本路径的复杂度却有所降低。这些现象表明了一种从记忆到泛化的转变,为延迟泛化提供了机制上的解释。
在本研究中,我们开发了两项新颖的指标,用于量化路径距离及单一路径的复杂度,并展示了它们预测多样化下游任务上泛化提升的能力。这些指标高效、计算简便,且仅依赖于训练数据,因此在预训练中具有实用价值,使我们无需微调和测试即可监控泛化性能。理论上,我们证明了更结构化的路径能够降低模型复杂度并提升泛化边界。
English
Grokking, i.e., test performance keeps improving long after training loss
converged, has been recently witnessed in neural network training, making the
mechanism of generalization and other emerging capabilities such as reasoning
mysterious. While prior studies usually train small models on a few toy or
highly-specific tasks for thousands of epochs, we conduct the first study of
grokking on checkpoints during one-pass pretraining of a 7B large language
model (LLM), i.e., OLMoE. We compute the training loss and evaluate
generalization on diverse benchmark tasks, including math reasoning, code
generation, and commonsense/domain-specific knowledge retrieval tasks.
Our study, for the first time, verifies that grokking still happens in the
pretraining of large-scale foundation models, though different data may enter
grokking stages asynchronously. We further demystify grokking's "emergence of
generalization" by investigating LLM internal dynamics. Specifically, we find
that training samples' pathways (i.e., expert choices across layers) evolve
from random, instance-specific to more structured and shareable between samples
during grokking. Also, the complexity of a sample's pathway reduces despite the
converged loss. These indicate a memorization-to-generalization conversion,
providing a mechanistic explanation of delayed generalization. In the study, we
develop two novel metrics to quantify pathway distance and the complexity of a
single pathway. We show their ability to predict the generalization improvement
on diverse downstream tasks. They are efficient, simple to compute and solely
dependent on training data. Hence, they have practical value for pretraining,
enabling us to monitor the generalization performance without finetuning and
test. Theoretically, we show that more structured pathways reduce model
complexity and improve the generalization bound.