MARS-M: Wenn Varianzreduktion auf Matrizen trifft

papers.abstract

Matrixbasierte vorkonditionierte Optimierer wie Muon haben sich kürzlich als effizienter erwiesen als skalare Optimierer für das Training großskaliger neuronaler Netze, einschließlich großer Sprachmodelle (LLMs). Andererseits haben aktuelle Benchmarks zu Optimierern für das Pre-Training von LLMs gezeigt, dass varianzreduzierende Techniken wie MARS erhebliche Beschleunigungen gegenüber Standardoptimierern ohne Varianzreduktion erzielen können. In diesem Artikel führen wir, um die Vorteile beider Welten zu vereinen, MARS-M ein – einen neuen Optimierer, der die Varianzreduktionstechnik von MARS mit Muon integriert. Unter Standard-Regularitätsbedingungen beweisen wir, dass Muon-M mit einer Rate von 𝒪(T^{-1/3}) zu einem stationären Punkt erster Ordnung konvergiert, was eine Verbesserung gegenüber der von Muon erreichten Rate von 𝒪(T^{-1/4}) darstellt. Unsere empirischen Ergebnisse zu Sprachmodellierungs- und Computer-Vision-Aufgaben zeigen, dass MARS-M durchgängig niedrigere Loss-Werte und verbesserte Leistung in verschiedenen Downstream-Benchmarks erzielt. Die Implementierung von MARS-M ist unter https://github.com/AGI-Arena/MARS/MARS_M verfügbar.

English

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of mathcal{O}(T^{-1/3}), which improves upon mathcal{O}(T^{-1/4}) rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/MARS_M.

MARS-M: Wenn Varianzreduktion auf Matrizen trifft

MARS-M: When Variance Reduction Meets Matrices

papers.abstract

Support