语言模型学什么及何时学?隐性课程假说探析
What do Language Models Learn and When? The Implicit Curriculum Hypothesis
April 9, 2026
作者: Emmy Liu, Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja, Jen-tse Huang, Graham Neubig
cs.AI
摘要
大型语言模型(LLMs)能够执行极其复杂的任务,然而在预训练过程中这些能力如何逐步形成的细粒度细节仍不甚明晰。基于验证损失的缩放定律揭示了模型如何随算力增加而改进,但未能说明其按何种顺序习得何种技能。为解决这一问题,我们提出"隐性课程假说":预训练过程在不同模型和数据组合中遵循着一种可组合且可预测的课程规律。我们通过设计一套涵盖检索、形态转换、指代消解、逻辑推理和数学运算的简单可组合任务来验证该假说。基于这些任务,我们追踪了参数量从4.1亿至130亿的四个模型族的能力涌现点。研究发现,模型达到固定准确率阈值的涌现顺序具有显著一致性(45组模型对的相关系数ρ=0.81),且复合任务大多在其组成任务掌握后涌现。进一步发现这种结构被编码于模型表征中:具有相似函数向量表征的任务在训练过程中也呈现相似的发展轨迹。通过利用任务集衍生的表征空间,我们能在预训练全程有效预测未参与训练的简单组合任务的发展轨迹(各模型R²介于0.68-0.84),而无需事先评估这些任务。这些结果表明,预训练过程比损失曲线所揭示的更具结构性:技能以组合顺序涌现,该顺序在不同模型间保持一致,并可通过模型内部表征进行解读。
English
Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent (ρ= .81 across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining (R^2 = .68-.84 across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.