MARS: 大規模モデルのトレーニングにおける分散削減の力を解放する

要旨

深層ニューラルネットワークのトレーニング、そしてより最近では大規模モデルのトレーニングには、効率的かつスケーラブルな最適化手法が求められます。Adam、AdamW、およびそれらの派生物などの適応的勾配アルゴリズムは、この課題に中心的な役割を果たしてきました。過去10年間に数多くの分散分散軽減アルゴリズムが開発され、凸面および非凸面の両方で確率的最適化を加速することを目指してきましたが、分散軽減は深層ニューラルネットワークや大規模言語モデルのトレーニングで広く成功を収めていません。その結果、現代のAIにおいてはあまり好まれないアプローチとなっています。本論文では、大規模モデルの効率的なトレーニングのために分散軽減の力を解き放つために、事前条件付き勾配法と分散軽減をスケーリングされた確率的再帰的モーメント技術を介して調和させる統一された最適化フレームワーク、MARS（Make vAriance Reduction Shine）を提案します。当フレームワーク内で、AdamW、Lion、Shampooに基づく事前条件付き勾配更新を活用するMARSの3つのインスタンスを紹介します。また、当アルゴリズムと既存の最適化手法との関連性についても述べます。GPT-2モデルのトレーニング実験結果は、MARSが一貫してAdamWを大きく上回ることを示しています。

English

Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

MARS: 大規模モデルのトレーニングにおける分散削減の力を解放する

MARS: Unleashing the Power of Variance Reduction for Training Large Models

要旨

Support