MARS-M：当方差缩减遇上矩阵方法

摘要

基于矩阵的预条件优化器（如Muon）近期被证明在训练大规模神经网络（包括大语言模型）时比基于标量的优化器更高效。另一方面，最新的大语言模型预训练优化器基准测试表明，采用方差缩减技术（如MARS）相比未使用该技术的标准优化器能实现显著加速。为兼取双方优势，本文提出MARS-M新型优化器，将MARS的方差缩减技术与Muon相融合。在标准正则性条件下，我们证明Muon-M以一阶驻点收敛速率\(\mathcal{O}(T^{-1/3})\)收敛，优于Muon的\(\mathcal{O}(T^{-1/4})\)收敛速率。我们在语言建模和计算机视觉任务上的实验结果表明，MARS-M在各类下游基准测试中持续获得更低的损失值和更优的性能。MARS-M的实现代码已发布于https://github.com/AGI-Arena/MARS/MARS_M。

English

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of mathcal{O}(T^{-1/3}), which improves upon mathcal{O}(T^{-1/4}) rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/MARS_M.

MARS-M：当方差缩减遇上矩阵方法

MARS-M: When Variance Reduction Meets Matrices

摘要

Support