MSign: 安定ランク回復による大規模言語モデルの学習不安定性を防止するオプティマイザ

要旨

大規模言語モデル（LLM）の事前学習において、訓練の不安定性は依然として重大な課題であり、しばしば急激な勾配爆発として現れ、多大な計算リソースを浪費する。本研究では、μPによってスケーリングされた5MパラメータのNanoGPTモデルにおける訓練失敗を分析し、崩壊に先行する二つの主要現象を特定する：（1）重み行列の安定ランク（フロベニウスノルムの二乗とスペクトルノルムの二乗の比）の急激な減少、（2）隣接する層のヤコビ行列間の整合性の増大。理論的に、これら二つの条件がネットワークの深さに伴う勾配ノルムの指数関数的増加を引き起こすことを証明する。この不安定性メカニズムを打破するため、安定ランクを回復するために行列符号演算を定期的に適用する新しい最適化手法MSignを提案する。5Mから3Bパラメータにわたるモデルでの実験により、MSignが7.0%未満の計算オーバーヘッドで訓練失敗を効果的に防止することを実証する。

English

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

MSign: 安定ランク回復による大規模言語モデルの学習不安定性を防止するオプティマイザ

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

要旨

Support