论自适应优化器中掩码更新的惊人有效性
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
February 17, 2026
作者: Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie
cs.AI
摘要
训练大型语言模型(LLMs)几乎完全依赖于采用日益复杂预处理器的稠密自适应优化器。我们通过证明随机掩码参数更新可高度有效来挑战这一现状:掩码版RMSProp持续优于当前最先进的优化器。分析表明,随机掩码会引发曲率相关的几何正则化,从而平滑优化轨迹。基于这一发现,我们提出动量对齐梯度掩码法(Magma),通过动量-梯度对齐机制调节掩码更新。大量LLM预训练实验表明,Magma可作为自适应优化器的简易替代方案,在保持计算开销可忽略的同时实现稳定性能提升。值得注意的是,在10亿参数模型规模上,Magma相比Adam和Muon分别将困惑度降低超过19%和9%。
English
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.