MARS-M:当方差缩减遇上矩阵方法
MARS-M: When Variance Reduction Meets Matrices
October 20, 2025
作者: Yifeng Liu, Angela Yuan, Quanquan Gu
cs.AI
摘要
基于矩阵的预条件优化器(如Muon)近期被证明在训练大规模神经网络(包括大语言模型)时比基于标量的优化器更高效。另一方面,最新的大语言模型预训练优化器基准测试表明,采用方差缩减技术(如MARS)相比未使用该技术的标准优化器能实现显著加速。为兼取双方优势,本文提出MARS-M新型优化器,将MARS的方差缩减技术与Muon相融合。在标准正则性条件下,我们证明Muon-M以一阶驻点收敛速率\(\mathcal{O}(T^{-1/3})\)收敛,优于Muon的\(\mathcal{O}(T^{-1/4})\)收敛速率。我们在语言建模和计算机视觉任务上的实验结果表明,MARS-M在各类下游基准测试中持续获得更低的损失值和更优的性能。MARS-M的实现代码已发布于https://github.com/AGI-Arena/MARS/MARS_M。
English
Matrix-based preconditioned optimizers, such as Muon, have recently been
shown to be more efficient than scalar-based optimizers for training
large-scale neural networks, including large language models (LLMs). On the
other hand, recent benchmarks on optimizers for LLM pre-training have
demonstrated that variance-reduction techniques such as MARS can achieve
substantial speedups over standard optimizers that do not employ variance
reduction. In this paper, to achieve the best of both worlds, we introduce
MARS-M, a new optimizer that integrates the variance reduction technique in
MARS with Muon. Under standard regularity conditions, we prove that Muon-M
converges to a first-order stationary point at a rate of
mathcal{O}(T^{-1/3}), which improves upon
mathcal{O}(T^{-1/4}) rate attained by Muon. Our empirical results on
language modeling and computer vision tasks demonstrate that MARS-M
consistently yields lower losses and improved performance across various
downstream benchmarks. The implementation of MARS-M is available at
https://github.com/AGI-Arena/MARS/MARS_M.