ChatPaper.aiChatPaper

谨慎权重衰减

Cautious Weight Decay

October 14, 2025
作者: Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu
cs.AI

摘要

我們提出了一種名為謹慎權重衰減(Cautious Weight Decay, CWD)的方法,這是一種僅需一行代碼、與優化器無關的改進措施,它僅對那些與優化器更新方向一致的參數座標施加權重衰減。與標準的解耦衰減不同,後者隱含地優化了一個正則化或約束的目標,而CWD則保留了原始損失函數,並允許雙層解釋:當達到穩定流形時,它會誘導滑模行為,從而能夠搜索未修改目標函數的局部帕累托最優穩定點。在實際應用中,CWD可無縫替換如AdamW、Lion和Muon等優化器,無需引入新的超參數或進行額外調節。對於語言模型預訓練和ImageNet分類任務,CWD在百萬至十億參數規模上持續提升了最終損失和準確率。
English
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
PDF98February 7, 2026