MSign：一種通過穩定秩恢復防止大型語言模型訓練不穩定的優化器

摘要

訓練不穩定性仍是大型語言模型預訓練的關鍵挑戰，常表現為突發的梯度爆炸並浪費大量計算資源。我們透過μP縮放的5M參數NanoGPT模型研究訓練失敗案例，發現崩潰前會出現兩種關鍵現象：(1) 權重矩陣穩定秩（Frobenius範數平方與譜範數平方之比）急遽下降；(2) 相鄰層雅可比矩陣間對齊度持續增強。我們從理論上證明這兩種條件共同導致梯度範數隨網絡深度呈指數增長。為打破此不穩定機制，我們提出MSign優化器，透過週期性應用矩陣符號運算來恢復穩定秩。在5M至3B參數模型上的實驗表明，MSign能以低於7.0%的計算開銷有效預防訓練失敗。

English

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

MSign：一種通過穩定秩恢復防止大型語言模型訓練不穩定的優化器

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

摘要

Support