ChatPaper.aiChatPaper

谨慎权重衰减

Cautious Weight Decay

October 14, 2025
作者: Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu
cs.AI

摘要

我们提出了一种名为“谨慎权重衰减”(Cautious Weight Decay, CWD)的方法,这是一种仅需一行代码、与优化器无关的改进方案,它仅在参数坐标的符号与优化器更新方向一致时应用权重衰减。与标准的解耦衰减不同,后者隐式地优化了一个正则化或约束目标,而CWD则保留了原始损失函数,并允许双层解释:当达到稳定流形时,它会引发滑动模态行为,从而能够搜索未修改目标的局部帕累托最优稳定点。在实际应用中,CWD可无缝集成到诸如AdamW、Lion和Muon等优化器中,无需引入新的超参数或额外调优。在语言模型预训练和ImageNet分类任务中,CWD在百万至十亿参数规模上持续提升了最终损失和准确率。
English
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
PDF98February 7, 2026