關於自適應優化器中遮罩更新的驚人效果
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
February 17, 2026
作者: Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie
cs.AI
摘要
訓練大型語言模型幾乎完全依賴於採用日益複雜預處理器的稠密自適應優化器。我們對此提出挑戰,通過證明隨機遮罩參數更新可具備卓越效果——帶有遮罩變體的RMSProp持續超越近期最先進的優化器。分析顯示,隨機遮罩會誘發一種依賴曲率的幾何正則化效應,從而平滑優化軌跡。基於此發現,我們提出動量對齊梯度遮罩法(Magma),該方法利用動量-梯度對齊關係調控遮罩更新。大量LLM預訓練實驗表明,Magma可作為自適應優化器的簡單即插即用替代方案,在計算開銷可忽略的前提下實現穩定效能提升。值得注意的是,在10億參數規模模型中,Magma相比Adam和Muon分別將困惑度降低超過19%和9%。
English
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.