LLM事前学習におけるGrokkingの検出方法：テストなしで記憶から汎化への移行を監視する

要旨

グロッキング（Grokking）、すなわち、訓練損失が収束した後もテスト性能が長期間にわたって向上し続ける現象は、最近ニューラルネットワークの訓練において観察され、一般化のメカニズムや推論などの新たな能力の出現を謎めいたものにしている。これまでの研究では、通常、小さなモデルを数千エポックにわたって少数のトイタスクや高度に特定されたタスクで訓練していたが、本研究では、7B規模の大規模言語モデル（LLM）、すなわちOLMoEのワンパス事前訓練中のチェックポイントにおけるグロッキングを初めて調査した。訓練損失を計算し、数学的推論、コード生成、常識/ドメイン固有知識検索タスクを含む多様なベンチマークタスクで一般化を評価した。本研究は、大規模基盤モデルの事前訓練においてもグロッキングが依然として発生することを初めて検証したが、異なるデータが非同期にグロッキング段階に入る可能性があることを示した。さらに、LLMの内部ダイナミクスを調査することで、グロッキングの「一般化の出現」を解明した。具体的には、訓練サンプルの経路（すなわち、層を跨ぐエキスパートの選択）が、グロッキング中にランダムでインスタンス固有のものから、より構造化されサンプル間で共有可能なものへと進化することを発見した。また、損失が収束しているにもかかわらず、サンプルの経路の複雑さが減少する。これらは、記憶から一般化への変換を示しており、遅延した一般化のメカニズム的な説明を提供する。本研究では、経路距離と単一経路の複雑さを定量化するための2つの新しい指標を開発した。これらの指標が、多様な下流タスクにおける一般化の改善を予測する能力を持つことを示した。これらは効率的で計算が簡単であり、訓練データにのみ依存する。したがって、事前訓練において実用的な価値を持ち、ファインチューニングやテストを行わずに一般化性能を監視することを可能にする。理論的には、より構造化された経路がモデルの複雑さを減少させ、一般化の境界を改善することを示した。

English

Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking's "emergence of generalization" by investigating LLM internal dynamics. Specifically, we find that training samples' pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples during grokking. Also, the complexity of a sample's pathway reduces despite the converged loss. These indicate a memorization-to-generalization conversion, providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound.

LLM事前学習におけるGrokkingの検出方法：テストなしで記憶から汎化への移行を監視する

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

要旨

Support