权重衰减提升语言模型的适应性
Weight Decay Improves Language Model Plasticity
February 11, 2026
作者: Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade
cs.AI
摘要
当前大语言模型(LLM)开发的主流范式是先对基础模型进行预训练,再通过后续训练优化性能与模型行为。然而,超参数优化与缩放定律的研究主要基于基础模型验证损失的角度,忽视了模型的下游适应能力。本研究从模型可塑性视角探讨预训练过程,即基础模型通过微调适应下游任务的能力。我们重点分析了权重衰减这一预训练中关键正则化参数的作用。通过系统实验发现,采用较大权重衰减值训练的模型具有更强的可塑性,表现为在下游任务微调时能获得更大的性能提升。这一现象可能导致反直觉的权衡:预训练后表现较差的基础模型,经微调后反而可能表现更优。进一步探究权重衰减对模型行为的作用机制发现,它能促进线性可分的表征学习、规范注意力矩阵并减轻训练数据的过拟合。本研究证实了超越交叉熵损失的超参数评估指标的重要性,并揭示了单一优化超参数在塑造模型行为中的多维作用。
English
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.