ChatPaper.aiChatPaper

权重衰减提升语言模型的适应性

Weight Decay Improves Language Model Plasticity

February 11, 2026
作者: Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade
cs.AI

摘要

当前大型语言模型(LLM)开发的主流范式是先对基础模型进行预训练,再通过后续训练优化性能与模型行为。然而,超参数优化与缩放定律的研究主要基于基础模型验证损失的角度,忽略了模型的下游适应能力。本研究从模型可塑性视角探讨预训练过程,即基础模型通过微调成功适应下游任务的能力。我们重点分析了权重衰减(预训练中关键的正则化参数)的作用。通过系统性实验发现,采用较大权重衰减值训练的模型具有更强的可塑性,这意味着它们在下游任务微调后能获得更大的性能提升。这一现象可能导致反直觉的权衡:预训练后表现较差的基础模型在微调后反而表现更优。对权重衰减影响模型行为的机制进一步研究表明,它能促进线性可分离表征的形成、规范注意力矩阵并减少对训练数据的过拟合。最后,本研究论证了在超参数优化中采用交叉熵损失之外评估指标的重要性,并揭示单一优化超参数在塑造模型行为时发挥的多重作用。
English
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
PDF11February 13, 2026